5

I use python regular expressions (re module) in my code and noticed different behaviour in theese cases:

re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']

and

re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']

What I need to get is just ['xyz', 'abc']. Why are the examples behave differently and how t get the desired result?

aplavin
  • 2,199
  • 5
  • 32
  • 53

2 Answers2

5

The reason a and b are included in the second case is because (?<=[a-z]\)) would first find a) and since lookaround's don't consume any character you are back at the start of string.Now [^.)]+ matches a

Now you are at ).Since you have made (?<=[a-z]\)) optional [^.)]+ matches xyz

This same thing is repeated with b) abc

remove ? from the second case and you would get the expected result i.e ['xyz', 'abc']

Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • The non-capturing group in the first case is optional, too (if no `a)` in text, then match the whole text). – aplavin Feb 04 '13 at 17:53
  • @chersanya that's why i had said second case not first case..there is difference between them – Anirudha Feb 04 '13 at 17:54
  • 1
    @chersanya also lookarounds checks for the specified pattern but it doesn't eat any characters..hence the result – Anirudha Feb 04 '13 at 17:56
  • Oh, I've got it) The real issue is that lookarounds don't consume anything, so findall finds `a` in `a)` too. – aplavin Feb 04 '13 at 18:00
  • Would you add the reason to your answer? – aplavin Feb 04 '13 at 18:01
  • @chersanya: The "not consume anything" is not a good explanation. The text which are skipped can be considered consumed. The reason your original regex fail is plainly because of the `?`. – nhahtdh Feb 04 '13 at 18:03
  • @nhahtdh: are you sure? Lookbehind doesn't consume text, so occurences of `a` in `a)` and `abc` in `abc` are non-overlapping. If it consumed, there would be no difference with the first case I provided. – aplavin Feb 04 '13 at 18:09
  • @chersanya: Look-behind doesn't consume text is correct. But since you make the look-behind optional, the regex is effectively `\s*[^.)]+`. Making look-behind optional seems to be only supported in Python and I don't know why they allow it - it doesn't make sense to do such thing, though. – nhahtdh Feb 04 '13 at 18:12
  • @nhahtdh: but if it consumed the text, the regex with lookbehind would (my 2nd case) be equivalent to the first case, which obviously differs from `\s*[^.)]+`? Or no (why)? – aplavin Feb 04 '13 at 18:15
  • @nhahtdh lookarounds can be optional..it's allowed in `.net`..but i agree that it really doesn't make sense – Anirudha Feb 04 '13 at 18:21
  • @chersanya: What I meant is that, due to `?`, the regex is **made equivalent** to `\s*[^.)]+`, since the result of the look-behind (whether true or false) doesn't stop the match. – nhahtdh Feb 04 '13 at 18:27
  • @chersanya: The "not consume anything" argument may come to play in some other case, but not this one. – nhahtdh Feb 04 '13 at 18:40
0

The regex you are looking for is:

re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')

I believe the currently accepted answer by Anirudha explains the differences between your use of positive lookbehind and non-capturing well, however, the suggestion of removing the ? from after the positive lookbehind actually results in [' xyz', ' abc'] (note the included spaces).

This is due to the positive lookbehind not matching the space character as well as not including space in the main matching character class itself.

TobalJackson
  • 139
  • 1
  • 4