7

Purely academic, but it's frustrating me.

I want to correct this text:

there there are are multiple lexical errors in this line line

using sed. I've got this far:

sed 's/\([a-z][a-z]*[ ,\n][ ,\n]*\)\1/\1/g' < file.text

It corrects everything except the final doubled up words!

there are multiple lexical errors in this line line

Can a sed guru please explain why the above doesn't deal with the words at the end?

benjwy
  • 71
  • 1
  • 1
  • 2
  • N.B. RE - `[ ,\n]` sed uses the `\n` as a line delimiter. So unless you insert `\n`'s into the pattern space, you will never encounter them after having read a line into the pattern space. – potong May 16 '12 at 23:39

1 Answers1

11

This is because in the last case (line) your regex memory 1 will have line (line followed by a space) in it and you are searching for its repetition. Since there is not space after the last line the match fails.

To fix this add a space after the ending word line.

Alternatively you can change the regex to:

sed -e 's/\b\([a-z]\+\)[ ,\n]\1/\1/g'

See it

codaddict
  • 445,704
  • 82
  • 492
  • 529