1

I have a pipe delimited list of phrases. I would like to remove sequential duplicates using a regex replace/substitution. For example:

dog|cat|cat woman|cat woman|dog|dog 
cat|cat|catman|catman|catman|cat woman|cat woman|dog|dogman|doggy

would be transformed into

dog|cat|cat woman|dog 
cat|catman|cat woman|dog|dogman|doggy

I am stuck. So far, I am at ((^|\|)([^\|]+))\1+ with a substitution of $1. But clearly, that does not work, for the output is

dog|cat woman|cat woman|dog 
cat|catman|catman|cat woman|dogman|doggy

Thanks for your help

hwm.nem
  • 11
  • 1

1 Answers1

1

You can set boundaries on the left and the right to prevent partial matches when using the capture group and the backreference.

If a lookbehind assertion is supported:

(?<![^|\n])([^|\n]+)(?:\|\1)+(?![^|\n])

The pattern matches:

  • (?<![^|\n]) Negative lookbehind, assert that what is directly to the left is not any char except | or a newline
  • ([^|\n]+) Capture group 1, match 1 or more times any char except | or a newline to prevent crossing lines
  • (?:\|\1)+ Repeat 1 or more times matching | and the backreference to group 1
  • (?![^|\n]) Negative lookahead that asserts that what is directly to the right is not any char except | or a newline

Regex demo

In the replacement you can use capture group 1.

Output

dog|cat|cat woman|dog
cat|catman|cat woman|dog|dogman|doggy

With thanks to Casimir et Hippolyte for the great improvement.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • If you don't put something to check what is on the left, you will get wrong results with something like `superman|man`. Also, you have to repeat the delimiter too. – Casimir et Hippolyte Mar 17 '22 at 16:00
  • @CasimiretHippolyte I think the alternation in the repeating group would be better than the lookahead variant. – The fourth bird Mar 17 '22 at 16:06
  • I will write it like this: https://regex101.com/r/jXkybt/1 – Casimir et Hippolyte Mar 17 '22 at 16:19
  • @CasimiretHippolyte That is awesome! It also accounts for `cat|cat` https://regex101.com/r/ymwLDP/1 – The fourth bird Mar 17 '22 at 16:22
  • @CasimiretHippolyte Are you going to post that solution? – The fourth bird Mar 17 '22 at 16:25
  • No, because I'm too lazy. – Casimir et Hippolyte Mar 17 '22 at 16:27
  • @CasimiretHippolyte Lazy or not, you have some reaaally good regex skills :-) I will update it with your gem. – The fourth bird Mar 17 '22 at 16:29
  • 1
    First, thanks! Second, what if lookbehind is not supported? Sadly, the application I am using this in does not support lookahead/lookbehind! – hwm.nem Mar 17 '22 at 17:22
  • 1
    @hwm.nem: And this application is ? A possibility, you replace first each pipe with two pipes, you use [this substitution](https://regex101.com/r/3p9woK/1) then you replace back each two pipes sequence with a single pipe, and together these patterns will rule the galaxy as father and son. – Casimir et Hippolyte Mar 17 '22 at 19:07
  • @CasimiretHippolyte: The software application is very industry specific. You would not be aware of it. Because of your comment, though, I decided to make a simple python function that can be called from the application, since it supports lookaround, and then I used your first method. THANK YOU! – hwm.nem Mar 18 '22 at 14:06