6

I'd like to search for lines that don't start with a pound sign (#) on indented code.

Currently, I'm using the regex ^\s*([^\s#].*) with multiline option on.

My problem is that on non commented lines it works perfectly.

On commented lines the regex engine performs a backtrack due to \s* all the way from the comment sign to the start of the line, which can sometimes cause 40 or 50 backtrack steps.

The regex works perfectly on python code. It's just not very efficient due to the backtracking caused by the engine.

Any idea as of how to avoid it?


Bonus: It's rather funny that the regex engine doesn't recognize the fact that it's searching for [^\s] one by one in \s* and causes this amount of backtracking. What are the challenges in making the re engine work so?

Bonus 2: Using only the stdlib re module. As I cannot add 3rd parties. (I'm technically searching using sublime text but want to know how to generally do it in Python)

smci
  • 32,567
  • 20
  • 113
  • 146
Bharel
  • 23,672
  • 5
  • 40
  • 80

2 Answers2

6

Use atomic feature of lookarounds to avoid backtrack:

^(?=(\s*))\1([^#].*)
    ^^^^^  ^

This usage is simplified in a negative lookahead which is proposed by @vks beautifully.

or possessive quantifiers while using regex module:

^\s*+([^#].*)

or even atomic groups:

^(?>\s*)([^#].*)

Sublime Text supports all three since being on PCRE.

and for bonus part, no it's not funny. If you be more eagle-eye on it you'll see it's not [^\s] which is literally equal to \S but it is a little bit different: [^\s#] which for engine means it has two different paths at each step to look for so it backtracks to reach one.

revo
  • 47,783
  • 14
  • 74
  • 117
  • Say wha? Is this a bug or a feature? Why doesn't the re engine cause backtracking on the lookarounds? Is this even documented? I mean, according to regex buddy this... Works... – Bharel Feb 04 '18 at 18:24
  • 1
    It's the way it works. Check [Lookaround Is Atomic](https://www.regular-expressions.info/lookaround.html). – revo Feb 04 '18 at 18:29
  • Last thing, regarding making the re recognize `\S` won't be in `\s*`. Even if its with a pound sign, there is still no chance for this combo to be inside a `\s`. I'm not proficient in how the internals of an engine works, but is it possible to somehow compile it so the engine will understand that it isn't possible and won't waste it's time? Sounds like eliminating these options already on the compilation stage is probably possible (not saying it's easy, but rather possible) – Bharel Feb 04 '18 at 19:19
  • 1
    There are some engine specific pre-scan optimizations which study pattern before falling into matching process or even while processing. Like when you do a `\s*\S` engine doesn't try to backtrack into `\s*` to match a non-whitespace character, engine matches `\s*` possessively. But when there are multiple paths to approach engine wouldn't have any idea without doing a backtrack, hence your case. – revo Feb 04 '18 at 19:27
4

You can simply say

^(?!\s*#).*

This takes just 6 steps in comparison to 33 steps taken by yours.

vks
  • 67,027
  • 10
  • 91
  • 124
  • Alright, I understand @revo's answer, regarding not backtracking back into lookarounds after leaving them. Why doesn't this backtrack **inside** the lookahead? Backtracks inside lookaheads are possible after all... – Bharel Feb 04 '18 at 19:37
  • I marked his only because of the explanation btw. Yours is also an awesome answer and I'll thank you a lot if you'll explain how it works :-) – Bharel Feb 04 '18 at 19:43
  • He's probably talking about [this kind of backtrack](https://regex101.com/r/Ym5qGj/1). @WiktorStribiżew – revo Feb 04 '18 at 19:44
  • @revo I think I got it. Does the engine evaluate the negative (`!`) only after exiting the atomic part, instead of before? That is `not(match())` instead of `match(not())`? – Bharel Feb 04 '18 at 19:52
  • 1
    @Bharel engine backtracks when it cannot fulfil condition....in this case my regex fulfils the first condition ...so why would engine backtrack....it might backtrack when it can't find # as then the condition is not fulfilled – vks Feb 04 '18 at 19:55
  • Awww... Well, if so, this backtracks on any non comment line like crazy which now turned my problem worse *snif snif* :-( – Bharel Feb 04 '18 at 20:03
  • No it doesn't since there is no path other than `#` next to whitespaces. Engine's smart enough. @Bharel – revo Feb 04 '18 at 20:07
  • @revo Erm... According to regex buddy it isn't :-( Maybe the pre-scan optimization isn't done there or in other regex checkers? – Bharel Feb 04 '18 at 20:09
  • Then I suspect if RegexBuddy applies any engine optimizations. @Bharel – revo Feb 04 '18 at 20:15
  • @revo well then. I guess regexbuddy is no longer my buddy. Thanks Revo! – Bharel Feb 04 '18 at 20:21
  • @Bharel this regex performs better in both scenarios – vks Feb 04 '18 at 20:36