I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:
$("div[class='should_be_using_dot_notation']");
$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));
I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.
I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.
Starting from this (This matches zero levels of nested parens, and no trivial empty ones):
\$\([^)(]+?\)
Here's what the one-level nested parens one I came up with looks like:
\$\(((\([^)(]*\))|[^)(])+?\)
Breaking it down:
\$\( begin text
( groups the contents of the $() call
(\( groups a level 1 nested pair of parens
[^)(]* only accept a valid pair of parens (it shall contain anything but parens)
\)) close level 1 nesting
| contents also can be
[^)(] anything else that also is not made of parens
)+? not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\) end
This worked great! But I need one more level of nesting.
I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *
's.
So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.
My question is two-fold.
What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?
I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.
Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...