10

I'm trying to come up with an example where positive look-around works but non-capture groups won't work, to further understand their usages. The examples I"m coming up with all work with non-capture groups as well, so I feel like I"m not fully grasping the usage of positive look around.

Here is a string, (taken from a SO example) that uses positive look ahead in the answer. The user wanted to grab the second column value, only if the value of the first column started with ABC, and the last column had the value 'active'.

string ='''ABC1    1.1.1.1    20151118    active
          ABC2    2.2.2.2    20151118    inactive
          xxx     x.x.x.x    xxxxxxxx    active'''

The solution given used 'positive look ahead' but I noticed that I could use non-caputure groups to arrive at the same answer. So, I'm having trouble coming up with an example where positive look-around works, non-capturing group doesn't work.

pattern =re.compile('ABC\w\s+(\S+)\s+(?=\S+\s+active)') #solution

pattern =re.compile('ABC\w\s+(\S+)\s+(?:\S+\s+active)') #solution w/out lookaround

If anyone would be kind enough to provide an example, I would be grateful.

Thanks.

Moondra
  • 4,399
  • 9
  • 46
  • 104
  • It's going to be something with what comes after the lookahead. Lookaheads are zero width (I think) and non capturing isn't. – sniperd Aug 29 '17 at 17:20
  • 2
    A group (capturing or non-capturing) consumes the string. A lookaround does not. – cs95 Aug 29 '17 at 17:21

1 Answers1

17

The fundamental difference is the fact, that non-capturing groups still consume the part of the string they match, thus moving the cursor forward.

One example where this makes a fundamental difference is when you try to match certain strings, that are surrounded by certain boundaries and these boundaries can overlap. Sample task:

Match all as from a given string, that are surrounded by bs - the given string is bababaca. There should be two matches, at positions 2 and 4.

Using lookarounds this is rather easy, you can use b(a)(?=b) or (?<=b)a(?=b) and match them. But (?:b)a(?:b) won't work - the first match will also consume the b at position 3, that is needed as boundary for the second match. (note: the non-capturing group isn't actually needed here)

Another rather prominent sample are password validations - check that the password contains uppercase, lowercase letters, numbers, whatever - you can use a bunch of alternations to match these - but lookaheads come in way easier:

(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!?.])

vs

(?:.*[a-z].*[A-Z].*[0-9].*[!?.])|(?:.*[A-Z][a-z].*[0-9].*[!?.])|(?:.*[0-9].*[a-z].*[A-Z].*[!?.])|(?:.*[!?.].*[a-z].*[A-Z].*[0-9])|(?:.*[A-Z][a-z].*[!?.].*[0-9])|...
Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
  • This was really helpful. Thank you so much! I'm having a little trouble understanding the second example. `(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=[!?.])` enforces that the password contains at least one lower case letter, one upper case letter. one digit etc etc? Since look arounds don't consume anything, we end up with an empty match right? I tried adding a '\S+' to the end of the pattern to see if this string `string = 'AZN###3232!abbb32.....''` would be captured, but I'm ending up with an empty match. I'm assuming my entire string should be captured. – Moondra Aug 29 '17 at 18:01
  • Well, I forgot a `.*` in the last lookahead. Yes, you might want to use it with anchors and a `.+` or `\S+` to match the whole string. – Sebastian Proske Aug 29 '17 at 18:03
  • 2
    Nice example! ... except that we shouldn't be checking password formats anymore: NIST has [deprecated this practice](https://pages.nist.gov/800-63-3/). – NH. Sep 11 '17 at 20:23