5

I tried to use regexes for finding max-length sequences formed from repeated doubled letters, like AABB in the string xAAABBBBy.

As described in the official documentation:

The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible.

When I use the quantifier {n,}, I get a full substring, but + returns only parts:

import re

print(re.findall("((AA|BB){3,})", "xAAABBBBy"))
# [('AABBBB', 'BB')]
print(re.findall("((AA|BB)+)", "xAAABBBBy"))
# [('AA', 'AA'), ('BBBB', 'BB')]

Why is {n,} more greedy than +?

Lover of Structure
  • 1,561
  • 3
  • 11
  • 27
Anton Ganichev
  • 2,184
  • 1
  • 18
  • 17
  • 2
    because the leftmost match wins. The second pattern matches the two first "A", can't match the remaining A, and matches the Bs. The first pattern starts at the second A. – Casimir et Hippolyte Mar 24 '23 at 19:56
  • 3
    Besides `(AA|BB){3,}` is not really equivalent of `(AA|BB)+`. `(AA|BB){1,}` will behave same – anubhava Mar 24 '23 at 19:58
  • 1
    Both patterns are greedy and matches as much as possible chars. But the second pattern can match 1 or more times but not 2 times. As the quantifier is 1 or more times, the pattern has a match. But the "rule" for the first pattern is 3 or more times, so the pattern can not match `BB` after the first `AA` in `AAABB` and moves to the second A and tries again and then it can match at least 3 times either `AA` or `BB` – The fourth bird Mar 24 '23 at 20:08
  • @Anton Ganichev by the way, what was your expected output, the "max length sequence" in the string? – cards Mar 24 '23 at 20:51

1 Answers1

5

Both quantifiers {3,} and + are greedy in the same way.

First, let's simplify the output a little bit by changing the inner group into a non-capturing one:

import re

print(re.findall("((?:AA|BB){3,})", "xAAABBBBy"))
# ['AABBBB']
print(re.findall("((?:AA|BB)+)", "xAAABBBBy"))
# ['AA', 'BBBB']

The first pattern requires a repetition (with the total number of occurrences being at least 3 – let's call this multiplicity ≥3), so the only possible match starts at the second A:

x AA A B B B By

The second pattern requires only multiplicity ≥1. As the string is scanned left-to-right, the first (left-most) possible greedy match is formed by the first two As. For the remaining string ABBBBy, the first (left-most) possible greedy match is BBBB. After that, only y remains, which can't be matched.

xA AAB B B By

Lover of Structure
  • 1,561
  • 3
  • 11
  • 27