Capture the latest in backreference

Question

I have this regex

(\b(\S+\s+){1,10})\1.*MY

and I want to group 1 to capture "The name" from

The name is is The name MY

I get "is" for now.

The name can be any random words of any length. It need not be at the beginning. It need on be only 2 or 3 words. It can be less than 10 words. Only thing sure is that it will be the last set of repeating words. Examples:

The name is Anthony is is The name is Anthony - "The name is Anthony".

India is my country All Indians are India is my country - "India is my country "

Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"

What is 'The name'? Is it a random word/s? Will the length of the supposed 'name' always be less than 10 characters? — Robo Mop, Apr 07 '18 at 16:04
Please provide some more comprehensive sample inputs and outputs. — Robo Mop, Apr 07 '18 at 16:06
It's hard to see from a single example what you really need. Trivially `(The name)(\s\S)*\s\1\sMY` matches your example. If this is not acceptable, why not? How many other duplicated strings can there be in a sample? Can we rely on position, like `^((\S+\s)+)(\S\s)*\1.*MY` or `((\S+\s)+)(\S\s)+\1\sMY`? — tripleee, Apr 07 '18 at 16:06
The name can be any random words of length <= 10. The name need not be at the start of the sentence. Only thing sure is that it will the last set of repeating words. The name is Anthony is is The name is Anthony - It should return "The name is Anthony". India is my country All Indians are India is my country . - "India is my country " Basically, the last repeating set of words. I don't want to hard code it. — user9474326, Apr 07 '18 at 16:07
Please [edit] your question with these clarifications. It's still not entirely clear how to generalize this, though. — tripleee, Apr 07 '18 at 16:11

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

You could try:

(\b\w+[\w\s]+\b)(?:.*?\b\1)

As demonstrated here

Explanation -

(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 07 '18 at 16:25

Robo Mop

3,485
1
10
23

I definitely appreciate constructive criticism, but please post a reason for the downvote. – Robo Mop Apr 07 '18 at 16:27
Thanks. This helped – user9474326 Apr 07 '18 at 16:32
@user9474326 Anytime! Also, consider upvoting the post if you feel it has really helped :) – Robo Mop Apr 07 '18 at 16:40

tripleee · Answer 2 · 2018-04-07T16:34:49.337

Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.

With that out of the way,

((\S+\s)+)(\S+\s){0,9}\1

would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like

this that more words this that more words

where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Thanks for the efforts. The answer from Coffeehouse Coder helped. — user9474326, Apr 07 '18 at 16:35

Capture the latest in backreference

2 Answers2

Explanation -