I have a regex in Java that matches zero or more whitespace characters, followed by a comma, followed by zero or more whitespace characters. To clean up the String
s I use the following:
String new1 = test1.replaceAll("\\s*,\\s*", ",");
If I understand correctly, this regex will backtrack if there is no comma in a given string. Say we have String s = "Hello World";
(4 spaces between words, no comma), then:
- The regex will start by trying to match the first
\\s*
, which succeeds by matching the four spaces after"Hello"
. - Then it tries to match the comma, which fails because there is no comma.
- The regex backtracks to the first
\\s*
and decides to match three spaces, leaving one space for the comma. It tries to match the comma again, which fails because there is still no comma. - It continues the process of and tries again until it has exhausted all possibilities. In the end, the match fails as there is no comma in the string.
I'm looking for a version of this pattern that will prevent the backtracking. My idea was to use atomic groups, i.e.:
String new1 = test1.replaceAll("(?>\\s*),(?>\\s*)", ",");
To test the approach, I wrote a simple JMH benchmark using a String
that has 50000 spaces between Hello
and World
, and no comma. On average, the approach with the atomic groups is 2x faster. This is of course better, but still unsatisfactory.
Bench result (see below for setup):
Benchmark Mode Cnt Score Error Units
RegexBench.atomicGroup thrpt 0,387 ops/s
RegexBench.replaceAll thrpt 0,162 ops/s
RegexBench.replaceExtended thrpt 0,248 ops/s
EDIT: Oddly enough, according to the https://regex101.com/ debugger, the pattern \s*,\s*
needs 86 steps to decide that there is no match, whereas the one that uses atomic groups 172. Consequently, the former shall be faster, so I'm perplexed why the latter is quicker in Java.
Here's the bench setup:
@State(Scope.Thread)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 1)
@Measurement(iterations = 1)
@Fork(1)
public class RegexBench {
final static String s1 = divideTwoWordsByNumberOfSpaces("hello", "world", 50000);
public static void main(final String[] args) throws IOException {
org.openjdk.jmh.Main.main(args);
}
static String divideTwoWordsByNumberOfSpaces(final String word1, final String word2, final int numberOfSpaces) {
final StringBuilder sb = new StringBuilder();
sb.append(word1);
for (int i = 0; i < numberOfSpaces; i++) {
sb.append(" ");
}
sb.append(word2);
return sb.toString();
}
@Benchmark
public void replaceAll(final Blackhole blackhole) {
String new1 = s1.replaceAll("\\s*,\\s*", ",");
}
@Benchmark
public void replaceExtended(final Blackhole blackhole) {
String new1 = s1.replaceAll("\\s*+,\\s*+", ",");
}
@Benchmark
public void atomicGroup(final Blackhole blackhole){
String new1 = s1.replaceAll("(?>\\s*),(?>\\s*)", ",");
}
}
And these dependencies to pom.xml:
<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>1.36</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-generator-annprocess -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>1.36</version>
</dependency>