3

Given this two files:

 $ cat A.txt     $ cat B.txt
    3           11
    5           1
    1           12
    2           3
    4           2

I want to find lines number that is in A "BUT NOT" in B. What's the unix command for it?

I tried this but seems to fail:

comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g' 
neversaint
  • 60,904
  • 137
  • 310
  • 477
  • 1
    You may have good reason to use a Unix one-liner, but have you considered writing a Perl or Python script to do it? This may be quicker to write and easier to read and modify. Python has set-based operations built into the language, so in a few lines, you can achieve what you're trying to do here. – avpx Jan 29 '10 at 05:16
  • 2
    @avpx: you're right. In Python, it's as simple as `''.join(set(open('A.txt')) - set(open('B.txt')))`. – Alok Singhal Jan 29 '10 at 05:22
  • @Alok: That's a pretty good way to do it, certainly shorter than the one I wrote. Kudos. – avpx Jan 29 '10 at 05:25

5 Answers5

10
comm -2 -3 <(sort A.txt) <(sort B.txt)

should do what you want, if I understood you correctly.

Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:

$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4
Alok Singhal
  • 93,253
  • 21
  • 125
  • 158
3

you can try this

$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
2

note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result

also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:

$ cat A.txt 
120
121
122
122
$ cat B.txt 
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122

if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):

$ comm -23 <(sort -u A.txt) <(sort B.txt)
120
sporobolus
  • 21
  • 2
1

I wrote a program recently called Setdown that does Set operations from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!

At any rate, I think that it's pretty cool and you should totally check it out.

Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.

Robert Massaioli
  • 13,379
  • 7
  • 57
  • 73
1

Here is another way to do it with join:

join -v1 <(sort A.txt) <(sort B.txt)

From the documentation on join:

‘-v file-number’ Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.

tommy.carstensen
  • 8,962
  • 15
  • 65
  • 108