diff gives all-different, but human compare shows equalities

Question

I have 2 files a:

2       m1.small
1       m3.large
2       m3.medium
2       t1.micro
1       t2.large
7       t2.medium
4       t2.micro
7       t2.small

and b:

1       c4.2xlarge
1       c4.large
2       m1.small
1       m3.large
3       m3.medium
1       m4.large
3       t1.micro
3       t2.large
11      t2.medium
7       t2.micro
7       t2.small

When I use sdiff I get all different as a result:

$ sdiff a b
2       m1.small           | 1       c4.2xlarge
1       m3.large           | 1       c4.large
2       m3.medium          | 2       m1.small
2       t1.micro           | 1       m3.large
1       t2.large           | 3       m3.medium
7       t2.medium          | 1       m4.large
4       t2.micro           | 3       t1.micro
7       t2.small           | 3       t2.large
                           > 11      t2.medium
                           > 7       t2.micro
                           > 7       t2.small

Whereas I can clearly see a match for at least

2       m1.small
1       m3.large
7       t2.small

Why is this, and can I do anything to optimize the result I get from diff.

I've also tried with meld(windows diff tool) and that gives me the exact same result.

Do both files have the same encoding/line ending? I can replicate your result if one of the files has Windows line endings. — Sven, Mar 31 '17 at 12:27
they ought to have yes. I pasted them manually in vi. a dos2unix on both yields the same diff result — ShadowFlame, Mar 31 '17 at 12:29

score 2 · Answer 1 · edited May 23 '17 at 12:41

Diff type utilities will compare files on a line by line basis, whereas you seem to be interested whether lines are common to the two files.

The comm utility may be what you are looking for, files will however need some preprocessing (field order and sorting):

cat a | awk '{ print $2 " " $1}' | sort > as
cat b | awk '{ print $2 " " $1}' | sort > bs

and then you can execute comm:

comm as bs

which gives an output in 3 columns (present in the left field, in right field or both):

        c4.2xlarge 1
        c4.large 1
                m1.small 2
                m3.large 1
m3.medium 2
        m3.medium 3
        m4.large 1
t1.micro 2
        t1.micro 3
t2.large 1
        t2.large 3
        t2.medium 11
t2.medium 7
t2.micro 4
        t2.micro 7
                t2.small 7

It's also possible to only emit lines only occurring in the left file (comm -2 -3) or the right file (comm -1 -3) etc.

That's as close to the result I think you're after one can get.

Actually this question seems pretty much the same as https://stackoverflow.com/questions/373810/unix-command-to-find-lines-common-in-two-files

An alternative solution to comm can be found there, to just identify the common lines, using awk. I reproduce it here because it's very elegant:

awk 'NR==FNR{arr[$0];next} $0 in arr' a b

a +1 for teaching me about comm :D – ShadowFlame Mar 31 '17 at 12:32 — ShadowFlame, Mar 31 '17 at 12:32

diff gives all-different, but human compare shows equalities

1 Answers1