Remove Duplicates from Multiple files with Awk or similar

Question

I have multiple 2-column, tab-separated files of different lengths in which I want to eliminate the duplicate values that are common to ALL files.

For example:

File 1:

9   1975
1518    a
5   a.m.
16  able
299 about
8   above
5   access

File 2:

6   a
6   abandoned
140 abby
37  able
388 about
17  above
6   accident

File 3:

5   10
8   99
23  1992
7   2002
29  237th
11  60s
8   77th
2175    a
5   a.m.
6   abandoned
32  able
370 about

File 4:

5   911
1699    a
19  able
311 about
21  above
6   abuse

The desired result is to have the items in Column 2 that are common to ALL files to be removed from each respective file. The desired result is the following:

File 1:

9   1975
5   a.m.
16  able
8   above
5   access

File 2:

6   abandoned
140 abby
37  able
17  above
6   accident

File 3:

5   10
8   99
23  1992
7   2002
29  237th
11  60s
8   77th
5   a.m.
6   abandoned
32  able

File 4:

5   911
19  able
21  above
6   abuse

Some of the standard methods to find duplicate values do not work for this task because I am trying to find those values that are duplicate to multiple files. Thus, something like comm or sort/uniq are not valid for this task. Is there a certain type of awk or other type of recursive tool that I can use to achieve my desired result?

What have you got so far? Personally, I'd be thinking `perl` with a hash to detect dupes. — Sobrique, Feb 20 '15 at 12:34
awk uses hashes too, no need for the p word. Can you have duplicate $2 values within a file? There's nothing recursive about your problem description so not sure what you meant by that last sentence asking for a `recursive tool`. — Ed Morton, Feb 20 '15 at 14:20

score 2 · Accepted Answer · answered Feb 20 '15 at 14:07

2

Something like this (untested) will work if you can't have duplicated $2s within a file:

awk '
FNR==1 {
    if (seen[FILENAME]++) {
        firstPass = 0
        outfile = FILENAME "_new"
    }
    else {
        firstPass = 1
        numFiles++
        ARGV[ARGC++] = FILENAME
    }
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' file1 file2 file3 file4

If you can have duplicated $2s within a file it's a tweak to only increment count[$2] the first time $2 appears in each file, e.g.

firstPass { if (!seen[FILENAME,$2]++) count[$2]++; next }

answered Feb 20 '15 at 14:07

Ed Morton

188,023
17
78
185

1

sounds very nice! I would think, though, that this may (I haven't tested) if a value appears twice in the same file and doesn't appear in another one: the count[$2] will be = numFiles... oh, wait, your just updated it. +1! – fedorqui Feb 20 '15 at 14:10
1

The files should `not` have any duplicated $2 within each individual file. I just tried this script and I have tested it (for files without duplicated $2s within a file) and it worked great. Will do more extensive testing but for the small files I have been trying out it works perfectly. – owwoow14 Feb 20 '15 at 14:34

score -1 · Answer 2 · answered Feb 20 '15 at 12:48

-1

I havent tested though, But this should do the trick. This will create files with ".new" extension.

awk '{a[$2]++;b[$2]=$0;c[$2]=FILENAME}
      END{
          for(i in a){if(a[i]==1)print b[i]>c[i]".new"}
      }' file1 file2 file3 file4

answered Feb 20 '15 at 12:48

Vijay

65,327
90
227
319

I just tried it out and I got the following error: `awk: syntax error at source line 1` as well as `awk: illegal statement at source line 1` – owwoow14 Feb 20 '15 at 12:58
1

@owwoow14 that means you are using old, broken awk (/usr/bin/awk on Solaris). Never use that awk. If you're on Solaris use /usr/xpg4/bin/awk or nawk instead (or even better use/install gawk if possible). This script will not do what you want though, it will do something like: create multiple identical files (one for each $2 value that occurs just once) that all contain just the lines whose $2s only occur once across all input files, shuffled into no discernable order and with all $1s that share $2 values across all files discarded but for the last one from the last file. – Ed Morton Feb 20 '15 at 13:50
@EdMorton I am actually using OSx Yosemite and I have solved the syntax problem, in case anyone else wants information regarding this error. I have since solved the problem using the solution above, and there is no longer a syntax error. – owwoow14 Feb 20 '15 at 14:35
1

Whatever OS you are on - AFAIK that specific error message only comes from old, broken awk so you need to get a different awk even if it's not reporting a syntax error for a given script and even if it appears to work for a given script. It is lacking in functionality and fundamentally broken in ways that will bite you later, if not now. – Ed Morton Feb 20 '15 at 14:39

Remove Duplicates from Multiple files with Awk or similar

2 Answers2