Get difference between two csv files based on column using bash

Question

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.

1   apple
2   banana
3   orange
4   pear

apple   0.89
banana  0.57
cherry  0.34

I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like

orange
pear

Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.

Thanks and appreciate for any help.

Are you sure your expected output is fully showed ? I guess you missed `cherry` ? — Zelnes, Aug 07 '19 at 07:24
@Zelnes: OP wants entries (second column) in the first file which are not present in second file (first column) — Inian, Aug 07 '19 at 07:28

Inian · Accepted Answer · 2019-08-07T07:25:56.220

You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.

Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv

awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv

To redirect the output to a new file, append > c.csv at the end of the previous command.

Set the field separators (input and output) to \t as you were reading a tab-delimited file.

The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }

Thanks for the detailed explaination, that's exactly what I need! — FrancisYL, Aug 08 '19 at 17:01

score 1 · Answer 2 · answered Aug 07 '19 at 07:36

Assuming your real data is sorted on the columns you care about like your sample data is:

$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear

This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.

If not already sorted:

comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)

score 1 · Answer 3 · answered Aug 07 '19 at 07:46

1

If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be

mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1  -f 01.txt then cut -f 2 02.txt

It gives you

orange
pear

It's a join in which it does not emit paired records and emits unpaired records from the left file.

answered Aug 07 '19 at 07:46

aborruso

4,938
3
23
40

1

This is also helpful, I will look into that for other tasks. – FrancisYL Aug 08 '19 at 17:45

Get difference between two csv files based on column using bash

3 Answers3