3

In answering a Splunk question on SO, the following sample text was given:

msg: abc.asia - [2021-08-23T00:27:08.152+0000] "GET /facts?factType=COMMERCIAL&sourceSystem=ADMIN&sourceOwner=ABC&filters=%257B%2522stringMatchFilters%2522:%255B%257B%2522key%2522:%2522BFEESCE((json_data-%253E%253E'isNotSearchable')::boolean,%2520false)%2522,%2522value%2522:%2522false%2522,%2522operator%2522:%2522EQ%2522%257D%255D,%2522multiStringMatchFilters%2522:%255B%257B%2522key%2522:%2522json_data-%253E%253E'id'%2522,%2522values%2522:%255B%25224970111%2522%255D%257D%255D,%2522containmentFilters%2522:%255B%255D,%2522nestedMultiStringMatchFilter%2522:%255B%255D,%2522nestedStringMatchFilters%2522:%255B%255D%257D&sorts=%257B%2522sortOrders%2522:%255B%257B%2522key%2522:%2522id%2522,%2522order%2522:%2522DESC%2522%257D%255D%257D&pagination=null

The person wanted to extract everything in the "filters" portion of the URL if "factType" was "COMMERCIAL"

The following all-in-one regex pulls it out neatly (presuming the URL is always in the right order (ie factType coming before filters):

factType=(?<facttype>\w+).+filters=(?<filters>[^\&]+)

According to regex101, it finds its expected matches with 670 steps

But if I break it up to

factType=(?<facttype>\w+)

followed by

filters=(?<filters>[^\&]+)

regex101 reports the matches being found with 26 and 16 steps, respectively

What about breaking up the regex into two makes it so much more (~15x) efficient to match?

warren
  • 32,620
  • 21
  • 85
  • 124
  • @anubhava what named capture group would that go into? Still need two named groups in this case – warren Jul 26 '22 at 20:59
  • In that case `factType=(?\w+)|filters=(?[^\&]+)` – anubhava Jul 26 '22 at 21:02
  • @anubhava - that regex is looking for *either* the first *or* the second. As soon as it finds one, it stops looking. Or'ing isn't going to get me there (at least not that way) ... and still doesn't address the marked differences in efficiencies of chained regexes vs the combined one :) – warren Jul 26 '22 at 21:13
  • I had already told you reason of slowness in my first comment that is use of .+ between 2 patterns – anubhava Jul 27 '22 at 03:17
  • @anubhava - you stated a possible reason. You didn't *explain* it. And I see you've also deleted your previous comments, which now makes what you said even harder to understand for someone coming to the question now – warren Jul 27 '22 at 13:11
  • 2
    So "sequential" would be 42 steps in total (16+26). It looks like the "combined" can be optimized to even match in [41 steps](https://regex101.com/r/9u9iT2/2) (the possessive quantifier `*+` is just to make it fail faster if no match - there is limited support for it among different regex engines). – bobble bubble Jul 27 '22 at 14:13
  • @warren: Deleted comment because I explained it in detail by posting an answer below. – anubhava Jul 27 '22 at 14:25

2 Answers2

3

The main problem with the regexp is the presence of the .+ where . eat (nearly) anything and * is generally greedy. Indeed, regexp engines are split in two categories: greedy engines and lazy ones. Greedy engines basically consume all the characters and backtrack as long as nothing is found while lazy ones consume characters only when the following pattern is not found. More engines are greedy. AFAIK, Java is the rare language to use a lazy engine by default. Fortunately, you can specify that you want a lazy quantifier with .+?. This means the engine will try to search the shortest possible match for .* instead of the longest one by default. This is what people usually do when writing a manual search. The result is 65 steps instead of 670 steps (10x better).

Note, that quantifiers do not always help in such a case. It is often better to make the regexp more precise (ie. deterministic) so to improve performance (by reducing the number of possible backtracks due to wrongly path tacking in the non-deterministic automaton).

Still, note that regexp engines are generally not very optimized compared to manual searches (as long as you use efficient search algorithms). They are great to make the code short, flexible and maintainable. For high-performance, a basic loop in native languages is often better. This is especially true if it is vectorized using SIMD instructions (which is generally not easy).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 2
    `.+? to make it more efficient` Not true. It is subject to the distance between 2 patterns you're trying to match. [Check this demo](https://regex101.com/r/aVWPH1/2) where `.+?` takes more that double steps than `.+` – anubhava Jul 27 '22 at 04:20
  • 2
    @anubhava Yeah, this is why I said "*the engine will try to search the shortest possible match for `.*` instead of the longest one by default*" and "*quantifiers do not always help in such a case*". The best solution is "*to make the regexp more precise (ie. deterministic)*" and emphasized the "more precise". If I understand correctly yur answer, this is what your solution does. In fact, it can certainly made more deterministic (lookaheads make the regexp running in non-linear time) but this makes the regexp hard to write/read (defeating their purpose). – Jérôme Richard Jul 27 '22 at 13:53
  • 1
    @anubhava In the end, the best solution is to use a regexp engine that use a deterministic automaton to guarantee a linear time (at least in all cases where it is theoretically possible). Most engines (PCRE) use NFA automatons that are inherently inefficient. A DFA with a minimization does that very well for such a case well. The RE2 library (of Google) does this. Sad most engines do not does the same. – Jérôme Richard Jul 27 '22 at 13:58
2

Here is a regex that would be inherently more efficient than .+ or .+? irrespective of the positions of those matches in input text.

factType=(?<facttype>\w+)(?:&(?!filters=)[^&\s]*)*&filters=(?<filters>[^\&]+)

This regex may look bit longer but it will be more efficient because we are using a negative lookahead (?!filters=) after matching & to stop the match just before filters query parameter.

Q. What is backtracking?
A. In simple words: If a match isn't complete, the engine will backtrack the string to try to find a whole match until it succeeds or fails. In the above example if you use .+ it matches longest possible match till the end of input then starts backtracking one position backward at a time to find the whole match of second pattern. When you use .+? it just does lazy match and moves forward one position at a time to get the full match.

This suggested approach is far more efficient than .* or .+ or .+? approaches because it avoids expensive backtracking while trying to find match of the second pattern.

RegEx Details:

  • factType=: Match factType=
  • (?<facttype>\w+): Match 1+ word characters and capture in named group facttype
  • (?:: Start non-capture group
    • &: Match a &
    • (?!filters=): Stop matching when we have filters= at next position
    • [^&\s]*: Match 0 or more of non-space non-& chars
  • )*: End non-capture group. Repeat this group 0 or more times
  • &: Match a &
  • filters=: Match filters=
  • (?<filters>[^\&]+): Match 1 or more of non-space non-& chars and capture in named group filters

Related article on catastrophic backtracking

anubhava
  • 761,203
  • 64
  • 569
  • 643