0

I'm trying to remove duplicated or similar lines, but I want to leave unselected only the last match, all duplicated or similar lines should be selected.

This is the text I want to clean (ignore line number only to show at what line I'm referring):

l1:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris
l2:
l3:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information". The painter and critic Maurice Denis shared a sense of bewilderment about Cézanne's revoluti
l4:
l5:Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
l6:
l7:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
l8:
l9:He overturned centuries of theories about how the eye works by depicting a world constantly in motion, affected by the passing of time and infused with the artist's own memories and emotions.
l10:
l11:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".

In this example, I want only last coincidence unselected in line 11, last line with this text

In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".

lines 1, 3, 5, 7 have some similar text or same text that should match the regex or be selected, the text on the line could be any text until new line and should detect more of this examples in the file.

I'm using this regex but is not working at all, only select l1 and l7 but should be select also l3 and l5 here is the example https://regex101.com/r/gd0Z3V/1:

(?sm)(^[^\r\n]*)[\r\n](?=.*^\1)
Carlos P.
  • 43
  • 5
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Oct 18 '22 at 01:56
  • There are no totally equal lines, they are different. – Poul Bak Oct 18 '22 at 02:21
  • Sorry now this line is duplicated `In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".` There are partial Strings repeated that matches the last line, I want to identify those partial or duplicated lines that mach the last line in some way containing partially same text or exact. Lines 1, 3, 5, 7 have some text last line has ,I don't know if I'm more clear on my problem? – Carlos P. Oct 18 '22 at 02:34
  • 1
    Please format your text better, there are not 11 lines! – Poul Bak Oct 18 '22 at 02:53
  • Done, please ignore the line number, text is the important – Carlos P. Oct 18 '22 at 03:03

1 Answers1

1

The main problem here is that regex doesn't understand human logic. "It looks the same" does not exist in regex. So the first requirement is to translate human logic to regex logic.

We can do that by specifying how many characters we want to be exactly the same to consider it a match.

Here I choose 100 characters. (You can of course change that, but it works with your example text).

Now we can build a regex that matches the whole line if 100 characters in that line is repeated further down the text:

/^.*(.{100}).*$(?=[\s\S]+\1)/gm

Explanation:

^.* - match from start of line zero or more characters

(.{100}) - create group 1, matching 100 characters

.*$ - match the rest of the line

(?=[\s\S]+\1) - look ahead for one or more of ANY character (including newline) followed be the text matched in group 1.

The result is that the whole line is matched, if 100 characters are repeated further down.

I have created a test case for you here: JSRegExpBuilder (it uses javascript but should work in most flavors).

Poul Bak
  • 10,450
  • 5
  • 32
  • 57
  • Thanks it works! I noticed that modifying characters (.{100}) to say 10, takes last characters of the line, could this can be changed to take the first characters of the line? – Carlos P. Oct 18 '22 at 05:16
  • 1
    Try changing the first `.*` to .*`?` making the first part 'non-greedy'. – Poul Bak Oct 18 '22 at 05:20
  • 1
    Or if you want to match from start of line, simply remove the first `.*`. Do what you find works best in your case. – Poul Bak Oct 18 '22 at 05:21
  • Wow this is all I need thank you so much @poul-bak – Carlos P. Oct 18 '22 at 05:26