8

Further to my previous question: ECMAScript Regex for a multilined string, I have implemented the following loading procedure:

void Load( const std::string& szFileName )
{
     static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
     static const std::regex regexData( "<([^>]+)>:([^<]*)\\n", std::regex_constants::ECMAScript | std::regex_constants::optimize );

     std::ifstream inFile( szFileName );
     inFile.exceptions( std::ifstream::badbit );

     std::string szFileData( (std::istreambuf_iterator<char>(inFile)), (std::istreambuf_iterator<char>()) );

     inFile.close();

     std::vector<std::future<void>> vecFutures;

     for( std::sregex_iterator itObject( szFileData.cbegin(), szFileData.cend(), regexObject ), end; itObject != end; ++itObject )
     {
          if( (*itObject)[1] == "OBJECT1" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
          else if( (*itObject)[1] == "OBJECT2" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
     }

     for( auto& future : vecFutures )
     {
          future.get();
     }
}

However, loading it with this file results in a Stack Overflow (parameters: 0x00000001, 0x00332FE4):

=== OBJECT2 ===
<Name>:Test Manufacturer
<Supplier>:Test Supplier
<Address>:Test Multiline
Contact
Address
<Email>:test@test.co.uk
<Telephone Number>:0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
<Number>:1
<Name>:Test
<Location>:Here
<Manufacturer>:
<Model Number>:12345
<Serial Number>:54321
<Owner>:Me
<IP Address>:0.0.0.0
=== END OBJECT1 ===

I have been unable to find the source of the Stack Overflow but it looks like the outer std::sregex_iterator loop is responsible.

Thanks in advance!

Community
  • 1
  • 1
Thomas Russell
  • 5,870
  • 4
  • 33
  • 68
  • 1
    Compiler: MSVC 2012 Update 3, OS: Windows 7 x64 – Thomas Russell Jul 07 '13 at 21:38
  • 1
    Some similar questions: http://stackoverflow.com/questions/15696435/c-11-regex-stack-overflow-vs2012 and http://stackoverflow.com/questions/12828079/why-does-stdregex-iterator-cause-a-stack-overflow-with-this-data – Mike Vine Jul 07 '13 at 21:52

4 Answers4

4

Holy catastrophic backtracking. The culprit is (?:.|\\n)*. Whenever you see a construct like this you know you're asking for trouble.

Why? Because you're telling the engine to match any character (except newline) OR newline, as many times as possible, or none. Let me walk you through it.

The engine will start as expected and match the === OBJECT2 ===-part without any major issues, a newline will be consumed, and hell will then begin. The engine consumes EVERYTHING, all the way down to === END OBJECT1 ===, and backtrack its way from there to a suitable match. Backtracking basically means going back one step and applying the regex again to see if it works. Basically trying all possible permutations with your string. This will, in your case, result in a few hundred thousand attempts. That's probably why stuff is being problematic for you.

I don't know if your code is any better or if it has any errors in it, but (?:.|\\n)* is the same as writing .* with the *s*ingle line modifier (dot matches newlines) or [\S\s]*. If you replace that construct with one of the two I have recommended you will hopefully no longer see a stack overflow error.

Edit: Check out the other solutions too, I did not really have time to go in-depth and provide a solid solution yo your problem besides explaining why its so bad.

Firas Dib
  • 2,743
  • 19
  • 38
  • 1
    +1 for the backtracking explanation. But do you mean replacing `(?:.|\\n)*` with `[\S\s]*`? (Or `.*` if VC++ has "dot matches newline" - standard C++ doesn't to my knowledge). `[\S\s]*` as far as I can tell would have the same amount of backtracking, just not quite as catastrophic, because the regex has fewer steps to do for each backtrack. But I'm very likely to be wrong. :-) – JimmiTh Jul 17 '13 at 18:17
  • @JimmiTh: There will still be a lot of backtracking with the solutions I provided, but no where near the amount of the original post. It will be completely manageable and OK. The backtracking will be different because the engine only has to reduce the match by the `[\S\s]` and not take alternations into account. A lazy match would in this case make the engine backtrack even less. – Firas Dib Jul 17 '13 at 20:22
  • I'm not sure I'm buying that explanation. `(?:.|\\n)*` is a little less efficient, but how does it cause [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html)? Assuming `.` does not match newlines, **there is only one way to match the string**. Backtracking would be linear (on successful or failed matches) - it would have to go back on the `.`s and try to match `\n`, but immediately fail, making it just a little slower than `(?s:.)`. – Kobi Jul 18 '13 at 05:26
  • @Kobi: Because the expression looks like `(.)*`, the quantifier is on the group, not the dot. So you're basically telling the engine to consume one character at a time instead of as much as possible. This means the engine will start at `<` on `:Test Manufacturer` and eat one character at a time all the way till the end (until `=== END OBJECT1 ===`, and then start backtracking from there. Then backtrack until the most recent `\n`, start over, backtrack, etc etc. – Firas Dib Jul 18 '13 at 07:23
  • @Lindrian - I'd expect `.*` and `(?:.)*` to act exactly the same here. `.*` also backtracks one character at a time. When you are matching `/.*Z/s`, for example, `.*` matches **until the end of the string**, and then backtracks on each character until `Z` can be matched. What you describe sounds like [Possessive Quantifiers](http://www.regular-expressions.info/possessive.html). – Kobi Jul 18 '13 at 07:35
  • @Kobi: Reality is different from what you expect. `.*` and `(.)*` act different. In this case I meant to write `(.|\n)*` but either way, there is a vast difference. And no, what I'm describing sounds like, if anything, the opposite of Possessive Quantifiers. – Firas Dib Jul 18 '13 at 07:47
4

Here's another attempt:

=== ([^=]+) ===\n((?:(?!===)[^\n]+\n)+)=== END \1 ===

In your C++ it would obviously be written as:

=== ([^=]+) ===\\n((?:(?!===)[^\\n]+\\n)+)=== END \\1 ===

It's made for minimal backtracking (at least when matching), although I'm a bit Mr. Tired-Face at the moment, so probably missed quite a few ways to improve it.

It makes two assumptions, which are used to avoid a lot of backtracking (that possibly causes the stack overflow, as others have said):

  1. That there's never a === at the start of a line, except for the start/end marker lines.
  2. That C++ supports these regex features - specifically the use of a negative lookahead (?!). It should, considering it's ECMAScript dialect.

Explained:

=== ([^=]+) ===\n

Match and capture the object start marker. The [^=] is one way to avoid a relatively small amount of backtracking here, same as yours - we're not using [^ ], because I do not know if there may be spaces in the OBJECT id.

((?:

Start capturing group for data. Inside it, a non-capturing group, because we're going to match each line individually.

   (?!===)

Negative lookahead - we don't want === at the start of our captured line.

   [^\n]+\n

Matches one line individually.

)+)

Match at least one line between start and end markers, then capture ALL the lines in a single group.

=== END \1 ===

Match the end marker.

Comparison (using RegexBuddy):

Original version:

  • First match: 1277 steps
  • Failed match: 1 step (this is due to the line break between the objects)
  • Second match: 396 steps

Every added object will cause the amount of steps to grow for the previous ones. E.g., adding one more object (copy of object 2, renamed to 3) will result in: 2203 steps, 1322 steps, 425 steps.

This version:

  • First match: 67 steps
  • Failed match: 1 step (once again due to the line break between the objects)
  • Second match: 72 steps
  • Failed match: 1 step
  • Third match: 67 steps
JimmiTh
  • 7,389
  • 3
  • 34
  • 50
  • This is a nice way of doing it :). I would suggest changing `[^\n]+` to just `.+` (or `.*` depending on what you need). – Firas Dib Jul 18 '13 at 07:51
  • 1
    @Lindrian: Yeah, `[^x]+x` tends to be my default in tired mode, making sure I avoid any unintended greediness. – JimmiTh Jul 18 '13 at 09:02
1

Your expressions appear to be causeing a lot of backtracking. I would change your expressions to:

First: ^===\s+(.*?)\s+===[\r\n]+^(.*?)[\r\n]+^===\s+END\s+\1\s+===

Second: ^<([^>]+)>:([^<]*)

Both of these expressions work with the options: Multiline, and DotMatchesAll options. By including the start of line anchor ^ it limits the backtracking to at most one line or one group.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • These 2 regular expressions result in no match being made to the data (i.e. the outer loop terminates instantly). – Thomas Russell Jul 08 '13 at 20:07
  • I've updated the answer with live examples showing how the expressions work. I suspect the problem isn't with the regexs, and lies somewhere in your code. – Ro Yo Mi Jul 09 '13 at 00:47
  • Even a simple example fails to match though; I would provide a live example, but unfortunately ideone uses GCC which doesn't currently doesn't supply a working `` implementation. – Thomas Russell Jul 10 '13 at 17:14
0

Try with this pattern instead:

static const std::regex regexObject( "=== (\\S+) ===\\n((?:[^\\n]+|\\n(?!=== END \\1 ===))*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125