0

I'm trying to remove repeated words in a text. The same issue described at these articles: Remove duplicate words in a line with sed and there: Removing duplicate strings with SED But these variants not work for me. May be becouse I'm using GnuWin32

Example what result I need:

Input

One two three bird animal two bird

Output

One two three bird animal
Yura Kosyak
  • 401
  • 3
  • 16
  • Why press dislikes and vote to "close" suggestions? Just try to do with SED the same as I asked in GNU and you'll understand that it's new question without any working answers! – Yura Kosyak Dec 09 '19 at 10:30

4 Answers4

3

I think this would be far faster in awk.

This should work on any platform, but I have not verified it on Windows:

awk '{
  sp = "";
  delete seen;
  for (i=1; i<=NF; i++) if (!seen[$i]++) { printf "%s%s", sp, $i; sp = " "; }
  printf "\n";
}' file

(Feel free to condense that onto one line, it'll work fine.)

AWK is great at columnar data. By default, it divides each line's text into fields separated by contiguous white space (so given hello world, we get $1 = "hello" and $2 = "world"). The special NF variable is the number of fields it found, so for (i=1; i<=NF; i++) iterates over each field (word) as i whose value is $i.

I'm using an associative array here (aka a dictionary or hash). The seen array at index $i (the current word) starts as zero (uninitialized). We increment it, but just like C, awk uses x++ to increment x but return its original value (contrast to ++x which increments and returns the incremented value). Therefore, !seen[$i]++ is true (!0) when we haven't yet incremented the array at this word—it is new to us. seen is cleared at each line so we have unique words per line rather than across the whole file.

Knowing that we haven't seen it, we need to print it. Note, the original white space between words is lost (it's not stored anywhere). We just print a space (but not at the beginning of a new line, thus the sp variable) and then the new word.

After the for loop, we complete the line. There will never by any trailing spaces. (Also, the actual line ending is lost, so we're assuming it's \n. If you want DOS line endings, use \r\n.)

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
2

The tool sed is not really designed for this work. sed only has two forms of memory, the pattern-space and the hold-space, which are nothing more then two simple strings it can remember. Every time you do an operation on such memory-block, you have to rewrite the full memory block and reanalyze it. Awk, on the other hand, has a bit more flexibility in here and makes it easier to manipulate the lines in question.

awk '{delete s}
     {for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
     {printf ORS}' file

But since you work on windows machine, it also means you have CRLF line-endings. This might create slight problems with the last entry. If the line reads:

foo bar foo

awk would read it as

foo bar foo\r

and thus the last foo will not match the first foo due to the CR.

A correction would now read:

awk 'BEGIN{RS=ORS="\r\n"}
     {delete s}
     {for(i=1;i<=NF;++i) if(!(s[$i]++)) printf (i==1?"":OFS)"%s",$i}
     {printf ORS}' file

This can be used since you use CygWin which is in the end GNU, so we can use the extension on of RS to be a regex or multi-character value.

If you want case-sensitivity you can replace s[$i] with s[tolower($i)].

There are still issues with sentences like

"There was a horse in the bar, it ran out of the bar."

The word bar could be matched here, but the , and . make it not match. This can be solved with:

awk 'BEGIN{RS=ORS="\r\n"; ere="[,.?:;\042\047]"}
     {delete s}
     {for(i=1;i<=NF;++i) {
        key=tolower($i); sub("^" ere,"",key); sub(ere "$","",key)
        if(!(s[key]++)) printf (i==1?"":OFS)"%s",$i
      } 
     }
     {printf ORS}' file

This essentially does the same, but removes the punctuation marks at the beginning and end of a word. The punctuation marks are listed in ere

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • With a line like "00:00:02.170 –> 00:00:06.915 foo bar foo foo bar", this awk is correctly removing the duplicate foo. It's also removing the second timestamp, how can I prevent this? @kvantour new output being "00:00:02.170 foo bar foo bar" – ladyskynet Jan 21 '20 at 16:12
  • @ladyskynet I am unable to reproduce this. Can you show me that line exactly using `cat -vET`? – kvantour Jan 21 '20 at 16:21
  • command I was testing was: echo "00:00:02.170 --> 00:00:06.915 Forward I I mean, I will be" | awk '{delete s} {for(i=1;i<=NF;++i) { key=tolower($i); sub(/^[^a-z]*/,"",key); sub(/[^a-z]*$/,"",key) if(!(s[key]++)) printf (i==1?"":OFS)"%s",$i } } {printf ORS}' with output being: 00:00:02.170 Forward I mean, will be – ladyskynet Jan 21 '20 at 19:11
  • cat -vet of the line gave me this: 00:00:02.170 --> 00:00:06.915^M$ Forward I I mean, I will be^M$ @kvantour – ladyskynet Jan 21 '20 at 19:35
  • @ladyskynet, I can reproduce this. The reason for this is that we remove all non-alphabetic characters. This includes the numbers. Hence, the strings `00:00:02.170` and `00:00:06.915` are equivalent. If you don't want this, you have to update the sub-command with `[^a-z0-9]`. But now that I look at this, this is not a good way. Also, the last method in the post was assumed to work on sentences, not log-related strings. But this can be updated. Let me check this for a second. – kvantour Jan 21 '20 at 19:39
  • @ladyskynet I have updated the code, it should now work as intended. (don't forget the RS definition on a windows machine. It is important. – kvantour Jan 21 '20 at 19:45
  • Thank you SO MUCH! @kvantour everything works as intended. – ladyskynet Jan 22 '20 at 00:46
1

This might work for you (GNU sed):

sed -E ':a;s/\<((\S+)\>.*)\s\<\2\>/\1/gi;ta' file

Match any word and remove the preceeding white space and its duplicate. Repeat.

N.B. The regexp removes duplicates without regard to case. If you want to treat One separately to one use:

sed -E ':a;s/\<((\S+)\>.*)\s\<\2\>/\1/g;ta' file
potong
  • 55,640
  • 6
  • 51
  • 83
  • The code is correct with small files but too slow with larger. I tried to use this code but program lagged over 20 mins and still lagging without any results. I have file with 60.000 + words and 310kb size – Yura Kosyak Dec 10 '19 at 17:02
0

For unique words that may include -- - / ' etc (where \< & \> would break the 'word', such as an option in a kernel command line):

  1. Pad the input string with a space before and after, " $string " below
  2. string=$(sed -E ':a;s/(\s(\S+)\s.*)\2\s/\1/;ta' <<< " $string ")
  3. Remove the pads, string=${string# }; string=${string% }
FGrose
  • 51
  • 1
  • 5