1

I have file1.csv that looks like this:

99495
123346
1132013

and file2.csv that looks like this:

99495,141
99495,138
123346,1

Both files have been sorted using:

sort -t, -k1n file1_unsorted.csv > file1.csv
sort -t, -k1n file2_unsorted.csv > file2.csv

I tried sorting using -k1g also, but with no difference in final result.

When I join the two on the first field of both CSVs,

join -t, -o 2.1 2.2 -1 1 -2 1 file1.csv file2.csv > joined.csv

I get the errors:

join: file1.csv:2: is not sorted: 123346
join: file2.csv:3: is not sorted: 123346,1

To me it seems like a problem of the 'sort' program sorting by field whereas the 'join' program trying to join while checking the sort on the first digit. Since 1 is smaller than 9, it says that the numbers are not sorted. I am guessing that padding the numbers with 0 will work but I don't want to do that. I want to be able to get join to recognize that sorting has been done by the whole field.

Please note that when I tried the above minimum working examples, the join did occur even though sort -c throws errors. But when I tried it on my (large) actual datasets, I get the errors I post above.

I am using version 8.28 of both join and sort on Ubuntu WSL on Windows 10.

Mishal Ahmed
  • 191
  • 2
  • 11

1 Answers1

1

First you need to sort the files lexicographically to join them, then sort numerically after joining:

join -t, <(sort -b unsorted1.csv) <(sort -b unsorted2.csv) | LC_ALL=C sort -n

The LC_ALL=C may be necessary because the , is used as decimal point character in some locales and may be used as digit grouping character on others.

M. Nejat Aydin
  • 9,597
  • 1
  • 7
  • 17
  • Sorting by -b is the only way it worked. Thanks. But now I wonder why sorting numerically (n) or generally (g) does NOT lead to issues in the minimum working example. – Mishal Ahmed Aug 28 '20 at 22:07
  • @MishalAhmed I believe it "works" by chance, because there are too few lines in your sample files. – M. Nejat Aydin Aug 28 '20 at 22:47