bash: Difference between join and comm

Question

# comm -12 /tmp/src /tmp/txt | wc -l
  10338
# join /tmp/src /tmp/txt | wc -l
  10355

Both the files are single columns of alphanumeric strings and sort-ed. Shouldn't they be the same?

Updated following @Kevin-s answer below:

cat /tmp/txt | sed 's/^[:space:]*//' > /tmp/stxt
cat /tmp/src | sed 's/^[:space:]*//' > /tmp/ssrc

and the result:

#join /tmp/ssrc /tmp/stxt | wc -l
516
# comm -12 /tmp/ssrc /tmp/stxt | wc -l
513

On manual inspection of the diff-s ... the results differ due to some whitespaces that were not taken out by the sed.

@keith Thompson It might be command specific - but I encountered them while choosing them for bash script. Hence the tag. — Tathagata, Aug 29 '11 at 18:35

Jonathan Leffler · Answer 1 · 2011-08-29T19:30:05.593

There are a couple of differences between comm and join:

comm compares whole lines; join compares fields within lines.
comm prints whole lines; join can print selected parts of lines.

When you have a single column of data in each file, there is relatively little difference. When you have multiple columns, there can be a lot of difference.

Also note that under the right circumstances, join can output multiple copies of the data from one file while joining with different lines from the other file. This looks to me like your problem; you probably have some duplicate values in one of the files. Suppose you have:

src           txt
123           123
              123
              123

If you do comm -12 src txt, you will get one line of output; if you do join src txt, you will get three lines of output. This is expected.

The join command can also handle 'outer joins' where data is missing from the second file for a line in the first file (a LEFT OUTER JOIN in terms of SQL) or vice versa (a RIGHT OUTER JOIN), or both at once (a FULL OUTER JOIN).

All-in-all, join is a more complex command, but it is attempting to do a more complex job. Both are useful; but they are useful in different places.

thanks for the answer, really informative. I generally `sort -k` on the column, but have never been comfortable using `join` - and find myself writing long strips of `awk` associative arrays to compare files .. lulz :D — Tathagata, Aug 29 '11 at 19:34

tripleee · Answer 2 · 2011-08-29T19:26:00.157

The main utility of join is to select lines which share one field, like you can do in a database. Say you have the following files:

File A
Alice  24
Bill   16
Claire 31
John   10
John  -14

File B
Bill   Copenhagen
John   Adelaide

... you can select the "John" and "Bill" lines from File A by giving File B as the file to join with, and the first field of both as the field to join on. The requirement that both files have to be sorted on that field is rather cumbersome in practice, though.

Kevin · Accepted Answer · 2011-08-29T19:52:55.147

1

~~I haven't used either extensively, but from a quick look at the man pages and test input, it seems that if the two files differ, comm prints both and join only prints matching lines.~~ The -12 took care of that. You could store the output of the two into files and do a diff to see how they differ.

$ echo -e '1\n2\n3\n5' > a
$ echo -e '1\n2\n4\n5' > b
$ comm a b
                1
                2
3
        4
                5
$ join a b
1
2
5
$

Edit: Join only compares the first whitespace-separated field but comm compares the whole line. Any whitespace on the line will therefore make the output differ.

edited Aug 29 '11 at 19:52

answered Aug 29 '11 at 18:23

Kevin

53,822
15
101
132

I'm using `comm -12` that suppress lines unique to FILE1, FILE2. the `diff`-s are too big - hurts eyes ;) – Tathagata Aug 29 '11 at 18:30
1

I see now I missed that. After further review of the man pages, it appears that join joins on the first whitespace-delimited field, but comm joins on the full line. Are there spaces in the input files? – Kevin Aug 29 '11 at 18:37
Good point ... I'll try to `sed` up the spaces and see if there is a difference ... :D – Tathagata Aug 29 '11 at 18:53
the problems were indeed due to rouge whitespace. Updated the question with the changes made to the file. The minor difference between the results that I have now, on manual inspection reveals are all because of whitespaces - wondering why `[:space:]` didn't take them down. Anyregex, can you please update your answer so that I can accept it? And mega thanks :D – Tathagata Aug 29 '11 at 19:27
I understood that rouge is French for red, but I don't know what the -s to diff does. – user2066657 Dec 12 '18 at 18:30
s/rouge/rogue ... :P – Tathagata Apr 09 '20 at 13:10

score 1 · Answer 4 · answered Aug 30 '11 at 11:23

1

Use [[:space:]] (instead of [:space:]) to strip whitespace with sed.

# compare
{
echo '   abc' | sed 's/^[:space:]*//'
echo '   abc' | sed 's/^[[:space:]]*//'
}

answered Aug 30 '11 at 11:23

jon

11
1

bash: Difference between join and comm

4 Answers4