regex matching a repeated subpattern

Question

I want regex to match a pattern only when it consists of a repeated subpattern. It can be boiled down to something as simple as the following. Given this text:

a a
a b
b b

I want a regex pattern that will only match "a a" and "b b" (and NOT "a b"), because there are two occurrences of the same subpattern on those lines.

I'm working in BBEdit, though the solution presumably would apply to any extended regex. I've been reading a lot about conditional subpatterns here on stack overflow and elsewhere, and experimenting as I go, but I don't seem to be able to make it work. Probably going to be chagrinned when I find out how simple it is. Bonus points (as usual) for explaining why the regex works the way it does.

hwnd · Accepted Answer · 2014-06-24T02:20:31.077

6

Well from your example data, you want to use a Backreference like so..

(.) \1

Explanation:

Backreferences allow you to refer back to what was previously matched by a capturing group.

A backreference is specified in the regular expression as a backslash (\) followed by a number indicating the number of the capturing group to be recalled.

(         # group and capture to \1:
  .       #   any character except \n
)         # end of \1
          # ' '
\1        # what was matched by capture \1

Live Demo

edited Jun 24 '14 at 02:20

answered Jun 24 '14 at 01:46

hwnd

69,796
4
95
132

This is a nice, elegantly simple solution. I hoped I was overcomplicating things, and I was. I was aware of backreferences, from using them to make substitutions, but kept trying to combine them with conditional patterns, that I see now are simply not needed here. Thank you! – larryy Jun 24 '14 at 03:33

Avinash Raj · Answer 2 · 2014-06-24T01:56:06.833

4

You could try this regex,

(?:(.) \1)

or

(.) \1

DEMO

It would capture the first character and checks it with the character next to space. It was done through backreference.

Explanation:(?:(.) \1)

(?:...) This is called non-capturing group.
(.) Catches the first character and stored it into a group.
Matches a space.
\1 First captured group is checked with this character. If both are same. Then it matches the whole.

Explanation:(.) \1

Same without a non-capturing group.

edited Jun 24 '14 at 01:56

answered Jun 24 '14 at 01:45

Avinash Raj

172,303
28
230
274

Don't need the surrounding groups. `(.) \1` is sufficient. – nneonneo Jun 24 '14 at 01:47
Thanks, this answers the question, with a good explanation. I accepted hwnd's post as the answer because it also did that, plus kept it simple. – larryy Jun 24 '14 at 03:38

score 1 · Answer 3 · answered Jun 24 '14 at 01:50

I'm not sure of the syntax in BBedit, but will something like this work?

/(.+) \1/

This assumes you're intending for the whitespace in between. This tells the regex to capture some group of characters and match the same group after a space.

If you don't have capture groups in BBedit, you can't do what you're asking since these types of strings are outside the set of strings produced by Regular Languages. http://en.m.wikipedia.org/wiki/Regular_language

Regular expression engines that allow for backreferences actually produce non-deterministic finite state automata which allow you to match a superset of Regular Language strings.

Another good answer, and technically this generalizes a bit better due to the "+", but hwnd's answer was sufficient to needs. Thanks. — larryy, Jun 24 '14 at 03:40

regex matching a repeated subpattern

3 Answers3