2

I'm trying to recover two positions using java regex

The first one is given by the regex:

val r="""(?=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""

The second one is given by the regex

val p="""(?<=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""

Note that the two expressions are identical, except the first "=" is replaced by an "<=" in the second expression. I am not using neste quantifiers here.

My command to test it is the following:

r.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...
p.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...

The first example is almost instant during execution, whereas the second takes dozens of seconds. If I launch the same examples in a REPL, both are very fast.

Where does that come from? How can I make the second expression faster?

Update: Why this matters

Note that in general, I can have expressions of the type:

[^ ]+[^.]+

and I would like to know when this regular expression can be found on the left of a given position, or when it can end. If I have the following data with the position below it:

abc145A
0123456

I would like the end of the previous expression to match position 1,2,3,4,5 and 6. If I use non-greedy repeating jokers, then it will match 1,3 and 5. If I use greedy operators, it matches only 6. This is why I need look-behind assertions. Or you will find me a way to define operators to find the positions I am looking for.

Community
  • 1
  • 1
Mikaël Mayer
  • 10,425
  • 6
  • 64
  • 101
  • 1
    I'm guessing the double lookbehind is causing it to loop over the same characters repeatedly - taking O(n^2) time instead of O(n) time. – Brilliand Mar 10 '14 at 16:09
  • Another possibility: {1,21474836} is a really, really big range, and the time taken by the lookbehind that contains it might be proportional to the size of that range. – Brilliand Mar 10 '14 at 16:17
  • I tried to lower the number to 2000 but it does not change anything. – Mikaël Mayer Mar 10 '14 at 16:28
  • What if you reduce it to 10? (Java might be reducing it to the length of the string you're searching through automatically.) – Brilliand Mar 10 '14 at 16:36
  • Reducing it to 10 improves the speed by a factor 4. – Mikaël Mayer Mar 10 '14 at 16:53
  • Can your results include spaces anywhere? It looks to me like those regexes could almost be replaced with a call to `String.split(" ")`. – Brilliand Mar 10 '14 at 17:10
  • The first Regex can be read as "matches where there is a space token followed by an uppercase word" – Mikaël Mayer Mar 10 '14 at 17:37
  • I can't get the given regex to match anything. It's probably because of the `(?=[ ]|$)` and `[A-Z]` parts, which basically say that the next character must be a space, but must also be a capital letter. – Brilliand Mar 10 '14 at 19:55
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/49423/discussion-between-brilliand-and-mikael-mayer) – Brilliand Mar 10 '14 at 20:11

1 Answers1

1

You aren't using nested quantifiers, but I suspect nested lookbehinds cause a similar problem. I suspect you don't need that outer lookahead/lookbehind at all - how about performing a single regex search using only the inner part of the regexes (common to both), and retrieving both the start position and the end position from each result?

Brilliand
  • 13,404
  • 6
  • 46
  • 58
  • I already tried before this approach, and it does not work in all cases. Indeed, if I use a single expression, it may match tokens which are too long and would hide another appearance of the token – Mikaël Mayer Mar 10 '14 at 16:19
  • You could get around that by searching for one match at a time in a loop, rather than searching for all matches with one method call. Alternately, you could put a group in the first regex (put parenthesis around "[^ ]{1,21474836}"), and use its length to determine where the end position is. – Brilliand Mar 10 '14 at 16:27