Positive lookbehind vs non-capturing group: different behaviuor

Question

I use python regular expressions (re module) in my code and noticed different behaviour in theese cases:

re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']

and

re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']

What I need to get is just ['xyz', 'abc']. Why are the examples behave differently and how t get the desired result?

Anirudha · Accepted Answer · 2013-02-04T18:19:28.667

5

The reason a and b are included in the second case is because (?<=[a-z]\)) would first find a) and since lookaround's don't consume any character you are back at the start of string.Now [^.)]+ matches a

Now you are at ).Since you have made (?<=[a-z]\)) optional [^.)]+ matches xyz

This same thing is repeated with b) abc

remove ? from the second case and you would get the expected result i.e ['xyz', 'abc']

edited Feb 04 '13 at 18:19

answered Feb 04 '13 at 17:53

Anirudha

32,393
7
68
89

The non-capturing group in the first case is optional, too (if no `a)` in text, then match the whole text). – aplavin Feb 04 '13 at 17:53
@chersanya that's why i had said second case not first case..there is difference between them – Anirudha Feb 04 '13 at 17:54
1

@chersanya also lookarounds checks for the specified pattern but it doesn't eat any characters..hence the result – Anirudha Feb 04 '13 at 17:56
Oh, I've got it) The real issue is that lookarounds don't consume anything, so findall finds `a` in `a)` too. – aplavin Feb 04 '13 at 18:00
Would you add the reason to your answer? – aplavin Feb 04 '13 at 18:01
@chersanya: The "not consume anything" is not a good explanation. The text which are skipped can be considered consumed. The reason your original regex fail is plainly because of the `?`. – nhahtdh Feb 04 '13 at 18:03
@nhahtdh: are you sure? Lookbehind doesn't consume text, so occurences of `a` in `a)` and `abc` in `abc` are non-overlapping. If it consumed, there would be no difference with the first case I provided. – aplavin Feb 04 '13 at 18:09
@chersanya: Look-behind doesn't consume text is correct. But since you make the look-behind optional, the regex is effectively `\s*[^.)]+`. Making look-behind optional seems to be only supported in Python and I don't know why they allow it - it doesn't make sense to do such thing, though. – nhahtdh Feb 04 '13 at 18:12
@nhahtdh: but if it consumed the text, the regex with lookbehind would (my 2nd case) be equivalent to the first case, which obviously differs from `\s*[^.)]+`? Or no (why)? – aplavin Feb 04 '13 at 18:15
@nhahtdh lookarounds can be optional..it's allowed in `.net`..but i agree that it really doesn't make sense – Anirudha Feb 04 '13 at 18:21
@chersanya: What I meant is that, due to `?`, the regex is **made equivalent** to `\s*[^.)]+`, since the result of the look-behind (whether true or false) doesn't stop the match. – nhahtdh Feb 04 '13 at 18:27
@chersanya: The "not consume anything" argument may come to play in some other case, but not this one. – nhahtdh Feb 04 '13 at 18:40

score 0 · Answer 2 · answered Aug 10 '17 at 13:49

The regex you are looking for is:

re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')

I believe the currently accepted answer by Anirudha explains the differences between your use of positive lookbehind and non-capturing well, however, the suggestion of removing the ? from after the positive lookbehind actually results in [' xyz', ' abc'] (note the included spaces).

This is due to the positive lookbehind not matching the space character as well as not including space in the main matching character class itself.

Positive lookbehind vs non-capturing group: different behaviuor

2 Answers2

Related