3

I have an And/Or regex i.e (PatternA|PatternB) in which I only take PatternA if PatternB does not exist (PatternB always comes after PatternA but is more important) so I put a negative lookahead in the PatternA Pipe.

This works on shorter text blocks:

https://regex101.com/r/bU6cU6/5

But times out on longer text blocks:

https://regex101.com/r/bU6cU6/2

What I don't understand is if I put PatternA with the Neg Look ahead alone in the same long text block it takes only 32 steps to reject it:

https://regex101.com/r/bU6cU6/3

and if I put PatternB alone in the same long text block it only takes 18 steps to accept it:

https://regex101.com/r/bU6cU6/4

So I am not sure why it is taking 100,000+/timeout to first reject (32 steps) then accept (18 steps) with the pipes. Is there another/better way to construct so it checks PatternA first than PatternB because now it is doing something I don't understand to go from 50 steps to 100k +.

user3649739
  • 1,829
  • 2
  • 18
  • 28
  • The lookahead is executed at each location in the document. That is very inefficient. You may move it right after `Option1:` text: `Option1:(?!.*Option2)\*.*?(?PBob|David|Ted|Alice)|\*Option2 (?PJuan)` – Wiktor Stribiżew Sep 15 '16 at 07:09
  • @WiktorStribiżew That makes sense. In a preceding question http://stackoverflow.com/questions/39482021/fixing-negative-assertion-for-end-of-string I was told to put it at the beginning of the string and accepted that answer and moved on. Perhaps I should unacceept it if that is in fact not the proper way as it appears it is not. The above and accepted answer(s) here both solved this nicely. – user3649739 Sep 15 '16 at 23:44

2 Answers2

1

Unanchored lookarounds used with a "global" regex (matching several occurrences) cause too much legwork, and are inefficient. They should be "anchored" to some concrete context. Often, they are executed at the beginning (lookaheads) or end (lookbehinds) of the string.

In your case, you may "anchor" it by placing after Option1: to ensure it is only executed after Option1: is aready matched.

Option1:(?!.*Option2)\*.*?(?P<Capture>Bob|David|Ted|Alice)|\*Option2 (?P<Capture2>Juan)
        ^^^^^^^^^^^^^

See this regex demo

Some more answers:

What I don't understand is if I put PatternA with the Neg Look ahead alone in the same long text block it takes only 32 steps to reject it

Yes, but you tested it with internal optimizations ON. Disable them and you will see

enter image description here

if I put PatternB alone in the same long text block it only takes 18 steps to accept it:

The match is found as expected, in a very efficient way:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Okay, I should have looked over your answer before submitting mine. I'm going to leave mine up, but @user, please accept this one. – Alan Moore Sep 15 '16 at 07:40
1

Your main problem is the position of the lookahead. The lookahead has to be tried at every position, and it has to scan all the remaining characters every time. The longer test string is over 3500 characters long; that adds up.

If your regex isn't anchored, you should always try to start it with something concrete that will fail or succeed quickly--literal text is the best. In this case, it's obvious that you can move the lookahead back: Option1:\*(?!.*Option2) instead of (?!.*Option2)Option1:\*. (Notice the lack of trailing .* in the lookahead; you didn't need that.)

But why is PatternA so much quicker when you match it alone? Internal optimizations. When the regex is just (?!.*Option2.*)Option1:\*.*?(?P<Capture>(Bob|David|Ted|Alice)), the regex engine can tell that the match must start with Option1:*, so it goes straight to that position for its first match attempt. The longer regex is too complicated, and the optimization doesn't occur.

You can test that by using the "regex debugger" option at Regex101, then checking DISABLE INTERNAL ENGINE OPTIMIZATIONS. The step count goes back to over 100,000.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156