7

The PCRE regex /..(?<=(.)\1)/ fails to compile: "Subpattern references are not allowed within a lookbehind assertion." Interestingly it seems to be acceptable in lookaheads, like /(?=(.)\1)../, just not in lookbehinds.

Is there a technical reason why backreferences are not allowed in lookbehinds specifically?

Connor Smith
  • 301
  • 1
  • 5
  • Backreferences generally can't be used inside look-behinds. Although, a workaround is possible, `..(?<=(?=(.)\1))` – hwnd Jun 06 '15 at 02:46
  • I'm wondering why, specifically. And it seems ever stranger that even `/..(?<=(.)(?=\1).)/` is accepted, when `/..(?<=(.)\1)/` is not. – Connor Smith Jun 06 '15 at 02:56
  • 1
    It's because variable length subpatterns are not allowed in a lookbehind. Since a backreference can have any length, it isn't allowed in a lookbehind too. With pcre, the classical workaround (when possible) is to use the `\K` feature. – Casimir et Hippolyte Jun 06 '15 at 08:49
  • That is not strange at all. `(?<=(.)(?=\1).)` works because the backreference is enclosed in a zero-width assertion (the lookahead), so the length of the subpattern in the lookbehind is constant. – Casimir et Hippolyte Jun 06 '15 at 09:03
  • 1
    Note that during the analyse of the pattern the fact that the subpattern in group 1 has a constant length is totally ignored. (in clear: backreference = variable length, that's all) – Casimir et Hippolyte Jun 06 '15 at 09:08
  • Thanks, that explains things. Though presumably it would be possible to check at compile time if the backreference is to a variable-length group? I guess at present it gets rejected before they have that information. – Connor Smith Jun 06 '15 at 11:13
  • @hwnd: `Backreferences generally...` I think that's not valid in general - for example the .net regex engine allows variable length look-behinds and even backreferences in look-behinds. The difference to PCRE is, that .net uses a "right to left" switch internally to implement look-behinds while PCRE (obviously) takes another approach by stepping back n characters and compare. I think there's a similar problem with backreferences... – Wolfgang Kluge Jun 09 '15 at 14:00
  • 1
    @WolfgangKluge .NET is one of the few languages that can do that. I haven't heard of any other "common" language that's able to do so. – HamZa Jun 10 '15 at 07:59

1 Answers1

3

With Python's re module, group references are not supported in lookbehind, even if they match strings of some fixed length.


Lookbehinds doesn't fully support PCRE rules. Concretely, when the regex engine reaches a lookbehind it'll try to determine it size, and then jump back to check the match.

This size determination brings you to a choice:

  • allow variable size, then every lookbehind needs to be executed before to jump back
  • disallow variable size, then we can directly jump back

As the first solution would be the best for us (users), it's obviously the slowest, and the hardest to develop. And so for PCRE regex, they resolved to use the second solution. The Java regex engine, for another example, allows semi-variable lookbehinds: you only need to determine the maximum size.


I came to PCRE and Python's re module.
I've not found anything else in PCRE documentation than this error code:

COMPILATION ERROR CODES
25: lookbehind assertion is not fixed length

But in this case, the lookbehind assertion is fixed length.
Now, here is what we can find in re documentation:

The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Group references are not supported even if they match strings of some fixed length.

We've got our guilty... If you want, you can try the Python's regex module , which seems to support variable length lookbehind.

zessx
  • 68,042
  • 28
  • 135
  • 158