One Step Towards Maintainable Regular Expressions In Java
Regular expressions are a great way to extract data from strings.
In Java, we typically use Pattern
and Matcher
:
String parseDate(String input) { Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})"); Matcher dateMatcher = datePattern.matcher(input); if (!dateMatcher.matches()) { throw new IllegalArgumentException("Invalid date format"); } String year = dateMatcher.group(1); String month = dateMatcher.group(2); String day = dateMatcher.group(3); return "Year: " + year + ", Month: " + month + ", Day: " + day; }
This works, but it´s not great.
There are a few things I don´t like about regular expressions and this example in particular:
The regex itself is already quite cryptic.
Group access via numbers is hard to follow.
Any change to the regex might require me to recount and update group indices.
Honestly, I don’t want to count parentheses every time I tweak something. It´s error-prone!
The good news is, there is a better alternative since Java 7.
Surprisingly, it´s still rarely used.
And no, it´s not extracting magic numbers into public static final int THREE = 3;
to keep Sonar happy.
(If you’re still on Java 6 or below: let’s talk.)
Positional Access Isn’t Just a Regex Problem
This kind of positional access actually reminds me of JDBC. Ever accessed result sets using numeric indices?
try (PreparedStatement ps = con.prepareStatement("SELECT first_name, last_name FROM users")) { try (ResultSet rs = ps.executeQuery()) { String firstName = rs.getString(1); String lastName = rs.getString(2); // ... } }
Sure, it works. But it´s fragile. One small change in the query, and suddenly your entire mapping is off.
That’s why we typically use column names instead:
try (PreparedStatement ps = con.prepareStatement("SELECT first_name, last_name FROM users")) { try (ResultSet rs = ps.executeQuery()) { String firstName = rs.getString("first_name"); String lastName = rs.getString("last_name"); // ... } }
Much better, it´s way more readable and maintainable. If we add or remove columns to or from the query, we don´t have to change unrelated code lines. And bugs like this are easier to spot:
String firstName = rs.getString("last_name");
Enter Named-Capturing Groups
Back to regular expressions. The same principle applies.
Instead of relying on numeric group positions, we can give each group a name.
Java supports this using the (?<name>text)
syntax:
String parseDate(String input) { Pattern datePattern = Pattern.compile("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})"); Matcher dateMatcher = datePattern.matcher(input); if (!dateMatcher.matches()) { throw new IllegalArgumentException("Invalid date format"); } String year = dateMatcher.group("year"); String month = dateMatcher.group("month"); String day = dateMatcher.group("day"); return "Year: " + year + ", Month: " + month + ", Day: " + day; }
That’s already a huge improvement. The regex is a bit more self-explanatory, and I no longer need to cross-reference group numbers with their meanings.
Now let’s say the year is optional. Here’s what that might look like:
Pattern datePattern = Pattern.compile("((?<year>\\d{4})-)?(?<month>\\d{2})-(?<day>\\d{2})");
If we had used numbered groups, this change would mean updating all the indices downstream. With named groups? No change needed. The rest of the code just keeps working.
(Yes, a non-capturing group like (?:text) would technically help here too, but not in every situation.)
Wrapping Up
Positional access works, but it doesn’t scale well.
Whether you´re dealing with SQL or regular expressions, named access is the more maintainable and readable choice.
So next time you write a regex in Java, consider using named-capturing groups. Your future self will thank you.
Comments