0

I have a regex in Java that matches zero or more whitespace characters, followed by a comma, followed by zero or more whitespace characters. To clean up the Strings I use the following:

String new1 = test1.replaceAll("\\s*,\\s*", ",");

If I understand correctly, this regex will backtrack if there is no comma in a given string. Say we have String s = "Hello World"; (4 spaces between words, no comma), then:

  1. The regex will start by trying to match the first \\s*, which succeeds by matching the four spaces after "Hello".
  2. Then it tries to match the comma, which fails because there is no comma.
  3. The regex backtracks to the first \\s* and decides to match three spaces, leaving one space for the comma. It tries to match the comma again, which fails because there is still no comma.
  4. It continues the process of and tries again until it has exhausted all possibilities. In the end, the match fails as there is no comma in the string.

I'm looking for a version of this pattern that will prevent the backtracking. My idea was to use atomic groups, i.e.:

String new1 = test1.replaceAll("(?>\\s*),(?>\\s*)", ",");

To test the approach, I wrote a simple JMH benchmark using a String that has 50000 spaces between Hello and World, and no comma. On average, the approach with the atomic groups is 2x faster. This is of course better, but still unsatisfactory.

Bench result (see below for setup):

Benchmark                    Mode  Cnt  Score   Error  Units
RegexBench.atomicGroup      thrpt       0,387          ops/s
RegexBench.replaceAll       thrpt       0,162          ops/s
RegexBench.replaceExtended  thrpt       0,248          ops/s

EDIT: Oddly enough, according to the https://regex101.com/ debugger, the pattern \s*,\s* needs 86 steps to decide that there is no match, whereas the one that uses atomic groups 172. Consequently, the former shall be faster, so I'm perplexed why the latter is quicker in Java.


Here's the bench setup:

@State(Scope.Thread)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 1)
@Measurement(iterations = 1)
@Fork(1)
public class RegexBench {

    final static String s1 = divideTwoWordsByNumberOfSpaces("hello", "world", 50000);

    public static void main(final String[] args) throws IOException {
        org.openjdk.jmh.Main.main(args);
    }

    static String divideTwoWordsByNumberOfSpaces(final String word1, final String word2, final int numberOfSpaces) {
        final StringBuilder sb = new StringBuilder();
        sb.append(word1);
        for (int i = 0; i < numberOfSpaces; i++) {
            sb.append(" ");
        }
        sb.append(word2);
        return sb.toString();
    }

    @Benchmark
    public void replaceAll(final Blackhole blackhole) {
        String new1 = s1.replaceAll("\\s*,\\s*", ",");
    }

    @Benchmark
    public void replaceExtended(final Blackhole blackhole) {
        String new1 = s1.replaceAll("\\s*+,\\s*+", ",");
    }

    @Benchmark
    public void atomicGroup(final Blackhole blackhole){
        String new1 = s1.replaceAll("(?>\\s*),(?>\\s*)", ",");
    }
}

And these dependencies to pom.xml:

 <!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core -->
    <dependency>
      <groupId>org.openjdk.jmh</groupId>
      <artifactId>jmh-core</artifactId>
      <version>1.36</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-generator-annprocess -->
    <dependency>
      <groupId>org.openjdk.jmh</groupId>
      <artifactId>jmh-generator-annprocess</artifactId>
      <version>1.36</version>
    </dependency>
SkogensKonung
  • 601
  • 1
  • 9
  • 22
  • I don't see why `test1.replaceAll("\\s*,\\s*", ",")` would result in backtracking. The regex engine should scan for either whitespace or a comma. If it finds whitespace then continue to scan for either more whitespace or a comma. If it then finds something else other than whitespace or a comma, there is no reason to backtrack. It should continue from that point looking for the next whitespace or comma. So if there is no comma in the string, it will just scan all the characters once. In other words, it should be easy to construct a DFSA that recognizes `"\\s*,\\s*"`. – Booboo May 23 '23 at 20:12
  • Here is a Microsoft article on backtracking in .NET. [Backtracking in .NET Regular Expressions | Microsoft Learn](https://learn.microsoft.com/en-us/dotnet/standard/base-types/backtracking-in-regular-expressions). – Reilas May 25 '23 at 04:45
  • Have you reviewed any of the code from [Matcher.java](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/regex/Matcher.java)? – Reilas May 25 '23 at 04:47

0 Answers0