11

The code below contains a regular expression designed to extract a C# string literal but the performance of the regex matching for input strings of more than a few characters is woeful.

class Program
{
   private static void StringMatch(string s)
    {
        // regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
        Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
        if (m.Success)
            Trace.WriteLine(m.Value);
        else
            Trace.WriteLine("no match");
    }

    public static void Main()
    {
        // this first string is unterminated (so the match fails), but it returns instantly
        StringMatch("\"OK");

        // this string is terminated (the match succeeds)
        StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");

        // this string is unterminated (so the match will fail), but it never returns
        StringMatch("\"This is another unterminated string and takes FOREVER to match");
    }
}

I can refactor the regex into a different form, but can anyone offer an explanation why the performance is so bad?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
adelphus
  • 10,116
  • 5
  • 36
  • 46
  • http://msdn.microsoft.com/en-us/magazine/ff646973.aspx – SLaks Mar 13 '12 at 16:02
  • I think it is wrong. `[^\"]` won't stop at `\"`. It will stop at `\` or at `"`. So it will stop at the `\` of `\n`. Is it right? – xanatos Mar 13 '12 at 16:04
  • 1
    Maybe you could modify your regex if you're not using the backreferences. `"\"(?:(?:[^\\\"]*)(?:\\.)?)*\""`. Of course if you ARE using the backreferences, then ignore this. – Matthew Mar 13 '12 at 16:11

6 Answers6

14

You're running into catastrophic backtracking:

Let's simplify the regex a bit (without the escaped quotes and without the second optional group because, as in your comment, it's irrelevant for the tested strings):

"(([^\\"]*))*" 

([^\\"]*) matches any string except quotes or backslashes. This again is enclosed in an optional group that can repeat any number of times.

Now for the string "ABC, the regex engine needs to try the following permutations:

  • ", ABC
  • ", ABC, <empty string>
  • ", AB, C
  • ", AB, C, <empty string>
  • ", AB, <empty string>, C
  • ", AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, C
  • ", <empty string>, AB, C, <empty string>
  • ", <empty string>, AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, <empty string>, C
  • ", A, BC
  • ", A, BC, <empty string>
  • ", A, <empty string>, BC
  • ", <empty string>, A, BC
  • etc.
  • ", A, B, C
  • ", A, B, C, <empty string>
  • ", A, B, <empty string>, C
  • etc. etc.

each of which then fails because there is no following ".

Also, you're only testing for substrings instead of forcing the regex to match the entire string. And you usually want to use verbatim strings for regexes to cut down on the number of backslashes you need. How about this:

foundMatch = Regex.IsMatch(subjectString, 
    @"\A     # Start of the string
    ""       # Match a quote
    (?:      # Either match...
     \\.     # an escaped character
    |        # or
     [^\\""] # any character except backslash or quote
    )*       # any number of times
    ""       # Match a quote
    \Z       # End of the string", 
    RegexOptions.IgnorePatternWhitespace);
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Your answer makes a valid point, but your permutation example is a poor man's regex matcher. I would expect any decent implementation to identify the locations of known/constant/literal characters first before attempting permutations of optional groups. After all, what's the point in attempting to match an optional group if the required literal characters don't exist?! – adelphus Mar 13 '12 at 16:45
  • 1
    @adelphus: The example may be contrived a bit, and I guess the .NET engine would indeed detect the immediately nested repetitions and optimize them away. But in your regex, it can't do this because there is the other (optional) `(\\\\.)?` group present that I dropped in my simplified example and which would have been attempted to match in the position marked as ``. As for the required literals, I doubt if there's a regex engine that will do that. Especially not if they are not anchored to fixed positions in the string. The .NET regex engine is one of the most sophisticated ones. – Tim Pietzcker Mar 13 '12 at 17:24
  • RegexBuddy has a nice feature which will detect possible performance issues with your expressions http://www.regexbuddy.com/debug.html – jessehouwing Mar 13 '12 at 18:38
2

EDIT

Here you go: "\"([^\\\\\"]|\\\\.)*\""

To explain, after C# has escaped the string you get this regex: "([^\\"]|\\.)*"

Meaning:

"                #start with a quote
(
    [^\\"]       #match a non backslash or quote
    |            #or
    \\.          #backslash something
)                
*                #And repeat
"                #end with a quote

By not nesting your * you don't get the exponential or infinite loop, and it returns instantly for me.

Buh Buh
  • 7,443
  • 1
  • 34
  • 61
  • This is true. The same problem occurs in the excluded chars group. – adelphus Mar 13 '12 at 16:36
  • OK cool, could you edit your question to fix this issue and then let us know if you still have these problems? – Buh Buh Mar 13 '12 at 16:45
  • I've corrected the code and, yes, the issue still exists. Thanks for the heads up. – adelphus Mar 13 '12 at 16:54
  • I do like the simplicity of your version and I will probably use it. I have to give the correct answer to Tim though since the question was "Why was the performance so bad" rather than "Give me a better regex". Thanks for your help though - much appreciated. – adelphus Mar 13 '12 at 17:20
1

Try

Match m = Regex.Match(s, @"'.*?(?<=[^\\](\\\\)*)'".Replace("'", "\""));

This will "intelligently" ignore even number of \. This because " closes a string, \" doesn't, \\" does (because the first backslash escapes the second one), \\\" doesn't...

.*? is a lazy quantifier. You can even use the standard .* quantifier. I'll say that perhaps you should anchor your regex with ^ and $.

I'm using the Replace because escaping double quotes gave me headaches :-)

I'll add that with a 4k text it is instantaneous on my computer, both in match and don't match.

As an alternative:

Match m = Regex.Match(s, @"^'(?>([^'\\]|\\.)*)'$".Replace("'", "\""));

Explanation:

(?> ) disables backtracking

^ begin of the string

then you have an alternating construct (0 or more times, the *):

[^'\\] any non-quote and non backslash

\\. or a backslash followed by another character (that is escaped)

$ end of the string

This too is very very fast, and it's quite easy to read.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • +1 Sometimes, putting the independent sub-expression construct (?> ) around too much, doesn't limit backtracking within that sub-expression, I think it limits it with regard to expressions outside of it. –  Mar 13 '12 at 18:19
1

I think @Tim Pietzcker gave the best explanation about the backtracking.

Through various benchmarks around (my own included) these are the fast ways:

Method 1, unrolling

" [^"\\]* (?: \\. [^"\\]* )* "

Method 2, alternation

" (?: \\. | [^"\\]+ )* "  

Method 1, can outperform 2 by substantial margins.

info

I think its really hard to explain catastrophic backtracking. Even recognising it is sometimes hard unless is it timewise very evident. Then, in time-critical applications it is sometimes beneficial to do some benchmarks.

On this quoting subject, I like to add new approaches to a benchmark templated perl (5.10 engined) script to see how it does. Each engine is a little different. If you care, here is a sample.

Regex samples vs. time using a heavily quoted and escaped string
itterated 100,000 times each.

(?x-ism:" ( (?: \\?. )*? ) ")
the code took:14.7031 wallclock secs (14.58 usr + 0.00 sys = 14.58 CPU)

(?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
the code took:12.8435 wallclock secs (12.75 usr + 0.00 sys = 12.75 CPU)

(?x-ism:" ( (?: [^\\"] | \\. )* ) ")
the code took:10.3123 wallclock secs (10.27 usr + 0.00 sys = 10.27 CPU)

(?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
the code took:8.39063 wallclock secs ( 8.39 usr + 0.00 sys = 8.39 CPU)

(?x-ism: " ( (?: [^"\\]+ | \\. )* ) " )
the code took:8.7498 wallclock secs ( 8.75 usr + 0.00 sys = 8.75 CPU)

(?x-ism: " ( (?: \\. | [^"\\]+ )* ) " )
the code took:8.5623 wallclock secs ( 8.44 usr + 0.00 sys = 8.44 CPU)

(?x-ism: " ( [^"\\]* (?: \\. [^"\\]* )* ) " )
the code took:7.79661 wallclock secs ( 7.80 usr + 0.00 sys = 7.80 CPU)

(?x-ism: (?> " ( (?: [^"\\] | \\. )* " ) ) )
the code took:10.5156 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU)

1

Here's what I would use:

"[^\n"\\]*(?:\\.[^\n"\\]*)*"

@sln is correct about the unrolled-loop approach being fastest, but I would refine it a bit more by excluding linefeeds, which are not allowed in old-fashioned string literals. The \\. part is okay, but [^"\\] needs to be changed to [^\n"\\]. Also, if we're talking about extracting string literals, we can't anchor the regex with \A and \Z.

I used RegexBuddy to compare the performance of your regex, Tim's regex without the anchors, and this one. I placed the cursor before the opening quote in each of your sample strings and used "Debug Here", and these are the results:

original regex        :  "(([^\\"\n]*)(\\.)?)*"

"OK                   :  failed in 101 steps

"This is a longer...  :  matched in 12 steps

"This is another...   :  gave up after 1,000,000 steps



Tim's regex           :   "(?:\\.|[^\\"\n])*"

"OK                   :  failed in 17 steps

"This is a longer...  :  matched in 211 steps

"This is another...   :  failed in 253 steps


unrolled loop         :  "[^\\"\n]*(?:\\.[^\\"\n]*)*"

"OK                   :  failed in 5 steps

"This is a longer...  :  matched in 5 steps

"This is another...   :  failed in 5 steps

Plugging that into your code as a verbatim string, you would get:

Match m = Regex.Match(s, @"""[^\n""\\]*(?:\\.[^\n""\\]*)*""");

EDIT: By the way, I'm not saying you must use this regex because it's faster; the other solutions are almost certainly fast enough. But if you do need maximum performance (while still using regex), this is probably the way to achieve it. What makes it so fast is that the regex always moves forward: no backreferences, no lookarounds, and most importantly, no backtracking.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

The backtracking issue is caused because everything is optionally quantified
inside nested quantifier groups. This block is surrounded by literal's.
Since the final literal is never reached, the engine try's each internal
sequence infinitely.

The only way around it is to put a literal stopping point inside the block.
This is the point that stops the backtracking.

Beyond that, you can put the literal inside two variable terms and use a clever
cluster to get the best result:

"  [^\\"]*  (?:  \\.    [^\\"]*  )*  "
     ^^^         ^^^      ^^^
     var       literal    var
sln
  • 2,071
  • 1
  • 3
  • 11