One Step Towards Maintainable Regular Expressions In Java

Regular expressions are a great way to extract data from strings. In Java, we typically use Pattern and Matcher:

String parseDate(String input) {
    Pattern datePattern = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})");
    Matcher dateMatcher = datePattern.matcher(input);
    if (!dateMatcher.matches()) {
        throw new IllegalArgumentException("Invalid date format");
    }

    String year = dateMatcher.group(1);
    String month = dateMatcher.group(2);
    String day = dateMatcher.group(3);

    return "Year: " + year + ", Month: " + month + ", Day: " + day;
}

This works, but it´s not great.

There are a few things I don´t like about regular expressions and this example in particular:

  1. The regex itself is already quite cryptic.

  2. Group access via numbers is hard to follow.

  3. Any change to the regex might require me to recount and update group indices.

Honestly, I don’t want to count parentheses every time I tweak something. It´s error-prone!

The good news is, there is a better alternative since Java 7. Surprisingly, it´s still rarely used. And no, it´s not extracting magic numbers into public static final int THREE = 3; to keep Sonar happy. (If you’re still on Java 6 or below: let’s talk.)

Positional Access Isn’t Just a Regex Problem

This kind of positional access actually reminds me of JDBC. Ever accessed result sets using numeric indices?

try (PreparedStatement ps = con.prepareStatement("SELECT first_name, last_name FROM users")) {
    try (ResultSet rs = ps.executeQuery()) {
        String firstName = rs.getString(1);
        String lastName = rs.getString(2);
        // ...
    }
}

Sure, it works. But it´s fragile. One small change in the query, and suddenly your entire mapping is off.

That’s why we typically use column names instead:

try (PreparedStatement ps = con.prepareStatement("SELECT first_name, last_name FROM users")) {
    try (ResultSet rs = ps.executeQuery()) {
        String firstName = rs.getString("first_name");
        String lastName = rs.getString("last_name");
        // ...
    }
}

Much better, it´s way more readable and maintainable. If we add or remove columns to or from the query, we don´t have to change unrelated code lines. And bugs like this are easier to spot:

String firstName = rs.getString("last_name");

Enter Named-Capturing Groups

Back to regular expressions. The same principle applies.

Instead of relying on numeric group positions, we can give each group a name. Java supports this using the (?<name>text) syntax:

String parseDate(String input) {
    Pattern datePattern = Pattern.compile("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
    Matcher dateMatcher = datePattern.matcher(input);
    if (!dateMatcher.matches()) {
        throw new IllegalArgumentException("Invalid date format");
    }

    String year = dateMatcher.group("year");
    String month = dateMatcher.group("month");
    String day = dateMatcher.group("day");

    return "Year: " + year + ", Month: " + month + ", Day: " + day;
}

That’s already a huge improvement. The regex is a bit more self-explanatory, and I no longer need to cross-reference group numbers with their meanings.

Now let’s say the year is optional. Here’s what that might look like:

Pattern datePattern = Pattern.compile("((?<year>\\d{4})-)?(?<month>\\d{2})-(?<day>\\d{2})");

If we had used numbered groups, this change would mean updating all the indices downstream. With named groups? No change needed. The rest of the code just keeps working.

(Yes, a non-capturing group like (?:text) would technically help here too, but not in every situation.)

Wrapping Up

Positional access works, but it doesn’t scale well.

Whether you´re dealing with SQL or regular expressions, named access is the more maintainable and readable choice.

So next time you write a regex in Java, consider using named-capturing groups. Your future self will thank you.

Comments