2

I'm having an issue with boost regex and suspect its a bug, but knew someone here would know for sure and if there's a workaround

I'm checking the start of a selection for start of string, white-space or an underscore using

(?<=^|\s|_)

However under boost this creates an error:

ERROR: Bad regular expression at char 0. Invalid lookbehind assertion encountered in the regular expression.

Without the ^, all is well and similarly with just the ^ its fine.

Any help getting around this would be greatly received.

Cheers

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Chris Barrett
  • 571
  • 4
  • 23
  • 1
    It's most likely due to the length that the positive lookbehind is capturing. `\s` and `_` both capture 1 character, whereas `^` captures 0 characters. This causes the lookbehind to be of non-fixed width and causes an error. You should instead use `(?:^|(?<=\s|_))`. The same error can be created by using `(?<=\s*)`, which, makes it non-fixed width (quantifiers aren't permitted in lookbehinds) – ctwheels Oct 02 '17 at 16:33
  • Thanks. Just realised i double posted this but like your solution. – Chris Barrett Oct 02 '17 at 16:40
  • You're very welcome. I've converted my comment to an answer to provide future viewers an easy way of viewing the answer. – ctwheels Oct 02 '17 at 16:45

2 Answers2

2

Brief

The code you presented (?<=^|\s|_) is a lookbehind using 3 possibilities:

  1. ^ Assert position at start of the line
  2. \s Match any whitespace character
  3. _ Match the underscore character literally

Note that with the above, 2. and 3. are identical in the number of characters that it will match: One, while 1. will match zero characters (position assertion).

Since 1. is of width 0, and 2. and 3. are of width 1, this causes the lookbehind to be of variable width. Some regex flavours will permit subtleties such as assertions to be used alongside fixed width matches, while others will not.

Typically, in lookbehinds, any quantifiers or variations thereof where matches don't share the same length (variable length) causes errors as you've seen.

Solution

Some regex flavours will permit your code to run, while others will not. For regex flavours that do not permit this sort of behaviour, workarounds should be used.

For your specific case, you can likely use the following regex to solve your issue

(?:^|(?<=\s|_))
ctwheels
  • 21,901
  • 9
  • 42
  • 77
2

Boost regex, like Python re, does not allow you to use alternatives of different length in a lookbehind (^ matches zero chars, while \s and _ match 1 char both). See the Boost reference:

(?<=pattern) consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length).

In these cases, it is a good idea to use a negative lookbehind with a negated character class matching any char but the ones you need. The (?<=^|\s|_) pattern will change into

(?<![^\s_])

It will match any location that is not immediately preceded with a char other than whitespace or _ (i.e. it will match the start of string (^), after a whitespace or _, just what you need).

See the regex demo:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563