3

This question comes from "Automate the boring stuff with python" book.

 atRegex1 = re.compile(r'\w{1,2}at')
 atRegex2 = re.compile(r'\w{1,2}?at')

 atRegex1.findall('The cat in the hat sat on the flat mat.')
 atRegex2.findall('The cat in the hat sat on the flat mat.')

I thought the question market ? should conduct a non-greedy match, so \w{1,2}? should only return 1 character. But for both of these functions, I get the same output:

['cat', 'hat', 'sat', 'flat', 'mat']

In the book,

nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()
'HaHaHa'

Any one can help me understand why there is a difference? Thanks!

dyxdyx
  • 83
  • 3
  • Lazy quantifiers only work going forward in the string. It will try every attempt at a given location and will *backtrack* **if needed**. It matches `flat` because at the position of the `f` in `flat`, it still matches your regex: It doesn't forward track, it **back**tracks. It won't match `lat` from `flat` because it's already consumed those characters. – ctwheels May 07 '18 at 20:04
  • Since everything is explained below, I can only suggest to use `\b\wat\b` or a more precise [`\b[^\W\d_]at\b`](https://ideone.com/IhHMeB) pattern if you plan to match 3-letter whole words only. And just bear in mind that a lazily quantified pattern part at the end of the regex always matches as few symbols as it can. So, `as*?` will only match `a`, `as{1,200}?` will always match `as`. – Wiktor Stribiżew May 07 '18 at 20:36

2 Answers2

1

Second regex has a known pattern to match: Ha for minimum 3 times and maximum 5 but as few as possible. So in this case it never goes beyond 3, the same as (Ha){3}. Engine's satisfied as soon as possible.

(Ha){3,5}? matches the same as below (consider groups as one):

(Ha){3}|(Ha){4}|(Ha){5}

and (Ha){3,5} matches the same as:

(Ha){5}|(Ha){4}|(Ha){3}

So if first side of alternation, in both regexes, is found there is no more try for a new match from engine.

What about \w{1,2}?at? Let's translate it:

(?:\w{1}|\w{2})at

First side of alternation has a priority - when found matching process is done. That's true about \w{1,2}at too:

(?:\w{2}|\w{1})at

Note: if first side doesn't match, engine goes with other sides in order.

revo
  • 47,783
  • 14
  • 74
  • 117
1

The issue you're experiencing is due to the nature of backtracking in regex. The regex engine is parsing the string at each given position therein, and as such, will attempt every option of the pattern until it either matches or fails at that position. If it matches, it will consume those characters and if it fails it will continue to the next position until the end of the string is met.

The keyword here is backtracks. I think Microsoft's documentation does a great job of defining this term (I've bolded the important section):

Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match. Backtracking is central to the power of regular expressions; it makes it possible for expressions to be powerful and flexible, and to match very complex patterns. At the same time, this power comes at a cost. Backtracking is often the single most important factor that affects the performance of the regular expression engine. Fortunately, the developer has control over the behavior of the regular expression engine and how it uses backtracking. This topic explains how backtracking works and how it can be controlled.

The regex engine backtracks to a previous saved state. It cannot forward track to a future saved state, although that would be pretty neat! Since you've specified that your match should end with at (the lazy quantifier precedes it), it will exhaust every regex option until \w{1,2} ending in at proves true.

How can you get around this? Well, the easiest way is probably to use a capture group:

See regex in use here

\w*(\w{1,2}?at)
\w*(\w{1,2}at)    # yields same results as above (but in more steps)
\w*(\wat)         # yields same results as above (faster method)
\wat              # yields same results as above (fastest method)
\b\w{1,2}at\b     # perhaps this is what OP is after?
  • \w* Matches any word character any number of times. This is a fix to allow us to simulate forward tracking (this is not a proper term, just used in the context of the rest of my answer above). It will match as many characters as possible and work its way backwards until a match occurs.
  • The rest of the pattern the OP already had. In fact, \w{2} will never be met since \w will always only be met once (since the \w* token is greedy), therefore \wat can be used instead \w*(\wat). Perhaps the OP intended to use anchors such as \b in the regex: \b\w{1,2}at\b? This doesn't differ from the original nature of the OP's regex either since making the quantifier lazy would have theoretically yielded the same results in the context of forward tracking (one match of \w would have satisfied \w{1,2}?, thus \w{2} would never be reached).
ctwheels
  • 21,901
  • 9
  • 42
  • 77