Remove duplicate words in a line with sed

Question

Purely academic, but it's frustrating me.

I want to correct this text:

there there are are multiple lexical errors in this line line

using sed. I've got this far:

sed 's/\([a-z][a-z]*[ ,\n][ ,\n]*\)\1/\1/g' < file.text

It corrects everything except the final doubled up words!

there are multiple lexical errors in this line line

Can a sed guru please explain why the above doesn't deal with the words at the end?

N.B. RE - `[ ,\n]` sed uses the `\n` as a line delimiter. So unless you insert `\n`'s into the pattern space, you will never encounter them after having read a line into the pattern space. — potong, May 16 '12 at 23:39

codaddict · Answer 1 · 2012-05-15T12:03:38.400

11

This is because in the last case (line) your regex memory 1 will have line (line followed by a space) in it and you are searching for its repetition. Since there is not space after the last line the match fails.

To fix this add a space after the ending word line.

Alternatively you can change the regex to:

sed -e 's/\b\([a-z]\+\)[ ,\n]\1/\1/g'

See it

edited May 15 '12 at 12:03

answered May 15 '12 at 11:58

codaddict

445,704
82
492
529

Remove duplicate words in a line with sed

1 Answers1

Linked