The tool sed
is not really designed for this work. sed only has two forms of memory, the pattern-space and the hold-space, which are nothing more then two simple strings it can remember. Every time you do an operation on such memory-block, you have to rewrite the full memory block and reanalyze it. Awk, on the other hand, has a bit more flexibility in here and makes it easier to manipulate the lines in question.
awk '{delete s}
{for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
{printf ORS}' file
But since you work on windows machine, it also means you have CRLF line-endings. This might create slight problems with the last entry. If the line reads:
foo bar foo
awk would read it as
foo bar foo\r
and thus the last foo will not match the first foo due to the CR.
A correction would now read:
awk 'BEGIN{RS=ORS="\r\n"}
{delete s}
{for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
{printf ORS}' file
This can be used since you use CygWin which is in the end GNU, so we can use the extension on of RS
to be a regex or multi-character value.
If you want case-sensitivity you can replace s[$i]
with s[tolower($i)]
.
There are still issues with sentences like
"There was a horse in the bar, it ran out of the bar."
The word bar
could be matched here, but the ,
and .
make it not match. This can be solved with:
awk 'BEGIN{RS=ORS="\r\n"; ere="[,.?:;\042\047]"}
{delete s}
{for(i=1;i<=NF;++i) {
key=tolower($i); sub("^" ere,"",key); sub(ere "$","",key)
if(!(s[key]++)) printf (i==1?"":OFS)"%s",$i
}
}
{printf ORS}' file
This essentially does the same, but removes the punctuation marks at the beginning and end of a word. The punctuation marks are listed in ere