Very slow look-behind

Question

I'm trying to recover two positions using java regex

The first one is given by the regex:

val r="""(?=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""

The second one is given by the regex

val p="""(?<=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""

Note that the two expressions are identical, except the first "=" is replaced by an "<=" in the second expression. I am not using neste quantifiers here.

My command to test it is the following:

r.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...
p.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...

The first example is almost instant during execution, whereas the second takes dozens of seconds. If I launch the same examples in a REPL, both are very fast.

Where does that come from? How can I make the second expression faster?

Update: Why this matters

Note that in general, I can have expressions of the type:

[^ ]+[^.]+

and I would like to know when this regular expression can be found on the left of a given position, or when it can end. If I have the following data with the position below it:

abc145A
0123456

I would like the end of the previous expression to match position 1,2,3,4,5 and 6. If I use non-greedy repeating jokers, then it will match 1,3 and 5. If I use greedy operators, it matches only 6. This is why I need look-behind assertions. Or you will find me a way to define operators to find the positions I am looking for.

I'm guessing the double lookbehind is causing it to loop over the same characters repeatedly - taking O(n^2) time instead of O(n) time. — Brilliand, Mar 10 '14 at 16:09
Another possibility: {1,21474836} is a really, really big range, and the time taken by the lookbehind that contains it might be proportional to the size of that range. — Brilliand, Mar 10 '14 at 16:17
I tried to lower the number to 2000 but it does not change anything. — Mikaël Mayer, Mar 10 '14 at 16:28
What if you reduce it to 10? (Java might be reducing it to the length of the string you're searching through automatically.) — Brilliand, Mar 10 '14 at 16:36
Can your results include spaces anywhere? It looks to me like those regexes could almost be replaced with a call to `String.split(" ")`. — Brilliand, Mar 10 '14 at 17:10
The first Regex can be read as "matches where there is a space token followed by an uppercase word" — Mikaël Mayer, Mar 10 '14 at 17:37
I can't get the given regex to match anything. It's probably because of the `(?=[ ]|$)` and `[A-Z]` parts, which basically say that the next character must be a space, but must also be a capital letter. — Brilliand, Mar 10 '14 at 19:55
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/49423/discussion-between-brilliand-and-mikael-mayer) — Brilliand, Mar 10 '14 at 20:11

score 1 · Answer 1 · answered Mar 10 '14 at 16:14

1

You aren't using nested quantifiers, but I suspect nested lookbehinds cause a similar problem. I suspect you don't need that outer lookahead/lookbehind at all - how about performing a single regex search using only the inner part of the regexes (common to both), and retrieving both the start position and the end position from each result?

answered Mar 10 '14 at 16:14

Brilliand

13,404
6
46
58

I already tried before this approach, and it does not work in all cases. Indeed, if I use a single expression, it may match tokens which are too long and would hide another appearance of the token – Mikaël Mayer Mar 10 '14 at 16:19
You could get around that by searching for one match at a time in a loop, rather than searching for all matches with one method call. Alternately, you could put a group in the first regex (put parenthesis around "[^ ]{1,21474836}"), and use its length to determine where the end position is. – Brilliand Mar 10 '14 at 16:27

Very slow look-behind

1 Answers1