Why is the regex quantifier {n,} more greedy than + (in Python)?

Question

I tried to use regexes for finding max-length sequences formed from repeated doubled letters, like AABB in the string xAAABBBBy.

As described in the official documentation:

The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible.

When I use the quantifier {n,}, I get a full substring, but + returns only parts:

import re

print(re.findall("((AA|BB){3,})", "xAAABBBBy"))
# [('AABBBB', 'BB')]
print(re.findall("((AA|BB)+)", "xAAABBBBy"))
# [('AA', 'AA'), ('BBBB', 'BB')]

Why is {n,} more greedy than +?

because the leftmost match wins. The second pattern matches the two first "A", can't match the remaining A, and matches the Bs. The first pattern starts at the second A. — Casimir et Hippolyte, Mar 24 '23 at 19:56
Besides `(AA|BB){3,}` is not really equivalent of `(AA|BB)+`. `(AA|BB){1,}` will behave same — anubhava, Mar 24 '23 at 19:58
Both patterns are greedy and matches as much as possible chars. But the second pattern can match 1 or more times but not 2 times. As the quantifier is 1 or more times, the pattern has a match. But the "rule" for the first pattern is 3 or more times, so the pattern can not match `BB` after the first `AA` in `AAABB` and moves to the second A and tries again and then it can match at least 3 times either `AA` or `BB` — The fourth bird, Mar 24 '23 at 20:08
@Anton Ganichev by the way, what was your expected output, the "max length sequence" in the string? — cards, Mar 24 '23 at 20:51

Lover of Structure · Answer 1 · 2023-05-28T10:03:21.947

Both quantifiers {3,} and + are greedy in the same way.

First, let's simplify the output a little bit by changing the inner group into a non-capturing one:

import re

print(re.findall("((?:AA|BB){3,})", "xAAABBBBy"))
# ['AABBBB']
print(re.findall("((?:AA|BB)+)", "xAAABBBBy"))
# ['AA', 'BBBB']

The first pattern requires a repetition (with the total number of occurrences being at least 3 – let's call this multiplicity ≥3), so the only possible match starts at the second A:

x A ⟨A A B B B B⟩ y

The second pattern requires only multiplicity ≥1. As the string is scanned left-to-right, the first (left-most) possible greedy match is formed by the first two As. For the remaining string ABBBBy, the first (left-most) possible greedy match is BBBB. After that, only y remains, which can't be matched.

x ⟨A A⟩ A ⟨B B B B⟩ y

Why is the regex quantifier {n,} more greedy than + (in Python)?

1 Answers1