regex for matching finite-depth nested strings -- slow, crashy behavior

Question

I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:

$("div[class='should_be_using_dot_notation']");

$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));

I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.

I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.

Starting from this (This matches zero levels of nested parens, and no trivial empty ones):

\$\([^)(]+?\)

Here's what the one-level nested parens one I came up with looks like:

\$\(((\([^)(]*\))|[^)(])+?\)

Breaking it down:

\$\(                   begin text
    (                  groups the contents of the $() call
        (\(            groups a level 1 nested pair of parens
            [^)(]*     only accept a valid pair of parens (it shall contain anything but parens)
        \))            close level 1 nesting
        |              contents also can be
        [^)(]          anything else that also is not made of parens
    )+?                not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\)                     end

This worked great! But I need one more level of nesting.

I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *'s.

So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.

My question is two-fold.

What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?

I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.

Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...

how do you make your regex properly deal with parentheses inside string literals which may occur as (part of) functions arguments ? — collapsar, Apr 17 '13 at 19:16
It is not dealt with (in my example). They would also get parsed so a string that has an unmatched paren would cause it to fail to match. Could be worked around with an explicit set of matchers for the quotes to recognize stuff as strings, and they can contain anything in them so it shouldn't add too much complexity to the regex itself. — Steven Lu, Apr 17 '13 at 19:19
@StevenLu PCRE supports recursion constructs in regular expressions and .NET has something called "balanced groups". Without one of those two features, nested structures is **the** non-regular language feature that regex cannot deal with. And even with them it gets really messy. You're better off, walking the string character by character and counting nesting levels (i.e. parse it manually, or get a JavaScript parser). — Martin Ender, Apr 17 '13 at 19:21
@m.buettner I agree with all that. The question here is to explore specifically what the regex engine is doing. I want to push it to its limits before resorting to more powerful specialized tools, and those limits do emphatically include *finite* levels of nesting. I also agree that actually doing it the basic way of counting is linear time and is probably the sensible thing to do... — Steven Lu, Apr 17 '13 at 19:23
My first guess for the extra time the *'s are taking would have to do with catastrophic backtracking: http://www.regular-expressions.info/catastrophic.html -- but that's not the only thing that can cause trouble with regex performance. — Kimball Robinson, Apr 17 '13 at 19:26
@steven: thx. wrt to your questions, i'd expect the 3rd but last line to be responsible as if effectively means that the sub-regex for the nested needs to be reentered after each character. you better use `[^)(]*`. the capture group inside the `$(...)` call should be greedy, that should prevent backtracking attempts. — collapsar, Apr 17 '13 at 19:27
@KimballRobinson that might be it. Is probably the answer. Can't believe I forgot about backtracking. — Steven Lu, Apr 17 '13 at 19:28

score 1 · Answer 1 · answered Apr 17 '13 at 19:44

Um. Okay, so nobody wants to write the answer, but basically the answer here is

Backtracking

It can cause exponential runtime when you do certain non-greedy things.

The answer to the first part of my question:

The two-nested expression is as follows:

\$\(((\(((\([^)(]*\))|[^)(])*\))|[^)(])*\)

The transformation to make the next nested expression is to replace instances of [^)(]* with (($[^)(]*$)|[^)(])*, or, as a meta-regex (where the replace-with section does not need escaping):

s/\[^\)\(\]\*/((\([^)(]*\))|[^)(])*/

This is conceptually straightforward: In the expression matching N levels of nesting, if we replace the part that forbids more nesting with something that matches one more level of nesting then we get the expression for N+1 levels of nesting!

score 1 · Answer 2 · answered Apr 17 '13 at 19:55

1

To match an arbitrary number of nested (), with only one pair on each level of nesting, you could use the following, changing 2 to whatever number of nested () you require

/(?:\([^)(]*){2}(?:[^)(]*\)){2}/

To avoid excessive backtracking you want to avoid using nested quantifiers, particularly when the sub-pattern on both sides of an inner alternation is capable of matching the same substring.

answered Apr 17 '13 at 19:55

MikeM

13,156
2
34
47

This does an okay job. But I do believe it will fail on e.g. `(abc(def)())` – Steven Lu Apr 17 '13 at 19:59
@StevenLu. _"with only one pair on each level of nesting"._ If you want more, see my answer [here](http://stackoverflow.com/questions/15310929/parse-text-file-with-regular-expression/15313305#15313305). – MikeM Apr 17 '13 at 20:02

regex for matching finite-depth nested strings -- slow, crashy behavior

2 Answers2

Linked