6

Regexp: (?=(\d+))\w+\1 String: 456x56

Hi,

I am not getting the concept, how this regex matches "56x56" in the string "456x56".

  1. The lookaround, (?=(\d+)), captures 456 and put into \1, for (\d+)
  2. The wordcharacter, \w+, matches the whole string("456x56")
  3. \1, which is 456, should be followed by \w+
  4. After backtracking the string, it should not find a match, as there is no "456" preceded by a word character

However the regexp matches 56x56.

tchrist
  • 78,834
  • 30
  • 123
  • 180
Suresh
  • 1,081
  • 4
  • 21
  • 44

5 Answers5

7

5) Regex engines concludes that it cannot find a match if it start searching from 4, so it skips one character and searches again. This time, it captures two digits into \1 and ends up matching 56x56

If you want to match only whole strings, use ^(?=(\d+))\w+\1$

^ matches beginning of string
$ matches end of string
Amarghosh
  • 58,710
  • 11
  • 92
  • 121
  • Thanks. I get it :). We know that there is no match for lookahead first group = '456'. Therefore instead of \w+ backtracking from the second position of string, the lookahead, being the first operation before \w+, adjusts/tracks its second group ='56', making the \w+ to make a match for the whole string "456x56". Why is it not happening this way. The lookahead is supposed to track for the next group, right?, before the \w+, as lookahead precedes. So should not there be two matches 456x56 and 56x56. The first, result of lookahead track, second one \w+ track – Suresh Jan 07 '12 at 17:14
  • @Maneesh Check this page for better understanding of internals of regex engine w.r.t look around http://www.regular-expressions.info/lookaround.html Btw, have you tested this regex with strings like "44" and even "445453"? – Amarghosh Jan 07 '12 at 17:21
  • @Maneesh See my answer below. I tried to detail what's happening. Or fge's answer which is particularly clear – Ludovic Kuty Jan 07 '12 at 19:02
  • @Ikuty. Yeah thanks. fge's answer was very clear. Not sure how can we generate the analysis that fge has put below. – Suresh Jan 08 '12 at 06:52
  • @Maneesh You should probably accept that answer instead of mine :) – Amarghosh Jan 08 '12 at 09:28
6

You don't anchor your regex, as has been said. Another problem is that \w also matches digits... Now look at how the regex engine proceeds to match with your input:

# begin
regex: |(?=(\d+))\w+\1
input: |456x56
# lookahead (first group = '456')
regex: (?=(\d+))|\w+\1
input: |456x56 
# \w+
regex: (?=(\d+))\w+|\1
input: 456x56|
# \1 cannot be satisfied: backtrack on \w+
regex: (?=(\d+))\w+|\1
input: 456x5|6 
# And again, and again... Until the beginning of the input: \1 cannot match
# Regex engine therefore decides to start from the next character:
regex: |(?=(\d+))\w+\1
input: 4|56x56
# lookahead (first group = '56')
regex: (?=(\d+))|\w+\1
input: 4|56x56
# \w+
regex: (?=(\d+))\w+|\1
input: 456x56|
# \1 cannot be satisfied: backtrack
regex: (?=(\d+))\w+|\1
input: 456x5|6
# \1 cannot be satisfied: backtrack
regex: (?=(\d+))\w+|\1
input: 456x|56
# \1 satified: match
regex: (?=(\d+))\w+\1|
input: 4<56x56>
fge
  • 119,121
  • 33
  • 254
  • 329
0

The points you listed are almost entirely, but not quite, wrong!

 1) The group  (?=(\d+)) matches a sequence of one or more digits
    not necessarily 456 
 2) \w captures only characters, not digits 
 3) \1 the is a back reference to the match in the group

So the role expression means find a sequence of digits followed by s sequence of word characters with are followed by the same sequence that was found in front of the characters. Hence the match 56x56.

Mithrandir
  • 24,869
  • 6
  • 50
  • 66
0

Well that's what makes it a positive lookahead

 (?=(\d+))\w+\1

You are correct when you say the first \d+ will match 456, so \1 must also be 456, but if that's the case: the expression won't match the string.

The only common characters of before the x and after the x are 56, and that's what it will do to get a positive match.

Tincan
  • 152
  • 7
0

The operator + is greedy and backtracks as necessary. The lookahead (?=(\d+)) will match 456 then 56 if the regex fails then 6 if the regex fails. First attempt: 456. It matches, the group 1 contains 456. Then we have \w+ which is greedy and takes 456x56, there is nothing left but we still have to match \1 i.e. 456. Thus: failure. Then \w+ backtraks one step at a time till we get to the beginning of the regex. And it still fails.

We consume a character from the string. Next backtrack is trying to lookahead match with substring 56. it matches and the group 1 contains 56. \w+ matches until the end of the string and gets 456x56 and then we try to match 56: failure. So \w+ bactracks until we have 56 left in the string and then we have a global match and regex success.

You should try it with regex buddy debug mode.

Ludovic Kuty
  • 4,868
  • 3
  • 28
  • 42