4

I would like to build q regular expression that matches patterns of repeated single characters, followed by each other. For example three times the same character 'A' followed by two times another character 'B'. It doesn't matter if the second group's character is repeated more than two times. For instance it should match the string wuzDDDFFFxji

Full match  3-8 `DDDFF`
Group 1.    3-4 `D`
Group 2.    6-7 `F`

I've come up with the following regular expression but there's one limitation.

(.)\1{2}(.)\2{1}

It almost works but it will not exclude the first group's character from being matched in the second group. The string qwuiuQQQQQsas will be matched since:

Full match  5-10    `QQQQQ`
Group 1.    5-6 `Q`
Group 2.    8-9 `Q`

This doesn't match what I want but I couldn't find the correct syntax to exclude a specific group from being matched in another one. My closest attempt doesn't seem to work

(.)\1{2}((?:\1))\2{1}


1st Capturing Group (.)
. matches any character (except for line terminators)
\1{2} matches the same text as most recently matched by the 1st capturing group
{2} Quantifier — Matches exactly 2 times
2nd Capturing Group ((?:\1))
Non-capturing group (?:\1)
\1 matches the same text as most recently matched by the 1st capturing group
\2{1} matches the same text as most recently matched by the 2nd capturing group
{1} Quantifier — Matches exactly one time (meaningless quantifier)

Any hint here? Thank you so much!

tbop
  • 394
  • 1
  • 3
  • 13

2 Answers2

4

To avoid matching qwuiuQQQQQsas you need to use a negative lookahead rather than a non-capturing group:

(.)\1{2}((?!\1).)\2
         ^^^^^^

See the regex demo.

The (?!\1) negative lookahead will "restrict" the . pattern to only match characters other than those matched into Group 1.

Non-capturing groups do not restrict any patterns, but are used to just group subpatterns that still consume text, and lookaheads (zero-width assertions) do not consume text and only check if the text meeting there pattern is present in the string or not.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Seems to work! I wasn't this far. Btw, there's something that puzzles me. Btw, given I want n times the same character why do I have to write {n-1} after the reference to the group? it seems the capture already counts for once itself. – tbop Jan 09 '17 at 11:24
  • 1
    You match a char with `.` that is inside capturing parentheses `()`. So, adding `\1{2}` after it will capture 2 more identical chars, 3 in total. Groups consume text, only lookarounds (lookbehind, lookaheads) and other zero-width assertions (word boundaries, anchors) do not consume text. – Wiktor Stribiżew Jan 09 '17 at 11:25
2

I would suggest using "\1 not followed by \1" pattern:

(.)\1+(?!\1)(.)\2+

Demo: https://regex101.com/r/QkqpzS/1

Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40