1

If I have pattern ([a-z]){2,4} and string "ab", what would I expect to see in backreference \1 ?

I'm getting "b", but why "b" rather than "a"?

I'm sure there is a valid explanation, but reading around various sites explaining regexes, I haven't found one. Anybody?

Monkeybrain
  • 766
  • 5
  • 23
  • You usually want the parentheses around the repeat factor too (`([a-z]{2,4})`). Otherwise, you get whatever you get (and that's about what you deserve). Why should it be 'a' rather than 'b'? It is ill-formed; not exactly wrong, but not exactly well written. – Jonathan Leffler Sep 23 '11 at 17:20
  • Thanks Jonathan. So are you saying the result from this operation is undefined, and the implementation is free to give me whatever it likes? – Monkeybrain Sep 23 '11 at 18:55
  • 1
    You can find the explanation here: http://www.regular-expressions.info/brackets.html#repeat . While you're at it, read the rest of the page, and then the rest of the site `:)` – Kobi Sep 23 '11 at 19:18
  • Sorta: see the nice link from Kobi. I was saying that I don't really remember whether it is defined and deterministic, but the link says "yes, it is defined and deterministic - and the last character is what is captured". The explanation makes sense. – Jonathan Leffler Sep 23 '11 at 20:15
  • Kobi - that explanation is perfect, exactly what I was looking for. Many thanks. – Monkeybrain Sep 23 '11 at 23:01

1 Answers1

0

I'm not sure why nobody put this as an answer, but just for anyone hitting this page with a similar question, the answer is essentially that this regex:

([a-z]){2-4}

will match a single character between a and z at least 2 and as many as 4 times. It will match each character separately, overwriting anything previously matched and stored into the backreference (that is, whatever is between the () characters in the expression).

A similar expression (suggested in the comments on the question):

([a-z]{2,4})

moves the back-reference to surround the entire match (2-4 characters a-z) instead of a single character.

The parentheses represent a capture into a back-reference. When the repetition is inside the capture (the second example), it will capture all characters that make up that repetition. When the repetition is outside the capture (the first example), it will capture one letter, then repeat the process, capturing the next letter into the same back-reference, thus overwriting it. In this case, it will then repeat that process up to 2 more times, overwriting it each time.

So, matching against the target abc will result in \1 equaling c. Matching the target against abcd will result in \1 equaling d. With more letters, and depending upon the function (and language) used to run the regular expression, the target abcde might fail to match, or might result in the back-reference \1 equaling d (because the e is not part of the match).

The first example expression can be used to get abc or abcd if you use the whole match back-reference (often times $& or $0, but also \& or \0 and in Tcl, just an & character) - this returns the entire string matched by the entire regular expression.

Code Jockey
  • 6,611
  • 6
  • 33
  • 45