Finding Set Complement in Unix

Question

Given this two files:

 $ cat A.txt     $ cat B.txt
    3           11
    5           1
    1           12
    2           3
    4           2

I want to find lines number that is in A "BUT NOT" in B. What's the unix command for it?

I tried this but seems to fail:

comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'

You may have good reason to use a Unix one-liner, but have you considered writing a Perl or Python script to do it? This may be quicker to write and easier to read and modify. Python has set-based operations built into the language, so in a few lines, you can achieve what you're trying to do here. — avpx, Jan 29 '10 at 05:16
@avpx: you're right. In Python, it's as simple as `''.join(set(open('A.txt')) - set(open('B.txt')))`. — Alok Singhal, Jan 29 '10 at 05:22
@Alok: That's a pretty good way to do it, certainly shorter than the one I wrote. Kudos. — avpx, Jan 29 '10 at 05:25

Alok Singhal · Accepted Answer · 2010-01-29T05:28:41.350

10

comm -2 -3 <(sort A.txt) <(sort B.txt)

should do what you want, if I understood you correctly.

Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:

$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4

edited Jan 29 '10 at 05:28

answered Jan 29 '10 at 05:10

Alok Singhal

93,253
21
125
158

score 3 · Answer 2 · answered Jan 29 '10 at 05:29

3

you can try this

$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4

answered Jan 29 '10 at 05:29

ghostdog74

327,991
56
259
343

@ghostdog74: Strange how come it gives different result in my machine: 3, 5, 1, 2, 4, – neversaint Jan 29 '10 at 06:06
what OS you running? use nawk on Solaris. – ghostdog74 Jan 29 '10 at 06:52

score 2 · Answer 3 · answered Dec 13 '11 at 23:25

note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result

also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:

$ cat A.txt 
120
121
122
122
$ cat B.txt 
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122

if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):

$ comm -23 <(sort -u A.txt) <(sort B.txt)
120

Robert Massaioli · Answer 4 · 2015-02-08T00:00:21.487

I wrote a program recently called Setdown that does Set operations from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!

At any rate, I think that it's pretty cool and you should totally check it out.

Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.

score 1 · Answer 5 · answered Feb 23 '21 at 09:42

1

Here is another way to do it with join:

join -v1 <(sort A.txt) <(sort B.txt)

From the documentation on join:

‘-v file-number’ Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.

answered Feb 23 '21 at 09:42

tommy.carstensen

8,962
15
65
108

Finding Set Complement in Unix

5 Answers5