Joining two different datasets using multiple keyvalues

Question

I have two sets of data.

The first dataset looks like:

Storm_ID,Cell_ID,Wind_speed
2,10236258,27
2,10236300,58
2,10236301,25
3,10240400,51

The second dataset looks like:

Storm_ID,Cell_ID,Storm_surge
2,10236299,0.27
2,10236300,0.27
2,10236301,0.35
2,10240400,0.35
2,10240401,0.81
4,10240402,0.11

Now I want an output which looks something like this:

Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11

I tried join command in Linux to perform this task and failed badly. Join command skipped the rows which didn't match in the database. I can use Matlab but the size of the data is more than 100 GB which is making it very difficult for this task. Can someone please guide me on this one please. Can I use SQL or python to complete this task. I appreciate your help Thanks.

@Gordon The small set of database has been represented in the above example. Sorry, if this is not the right way to do it. I am new to stack overflow — Sami, Mar 30 '18 at 02:07
@GordonLinoff I'm not sure if he wants to use a database at all. It looks like he has a CSV file and wants to use the UNIX `join` command https://shapeshed.com/unix-join/ To me it looks like an interesting question, but he still needs to explain a bit more what he means with `skipped the rows which didn't match in the database`. As if he has the data in SQL and in CSV, and when he compares a SQL JOIN to the CSV UNIX join, the results differ. We can't know why from the info given. — Daniel F, Mar 30 '18 at 15:43

score 1 · Answer 1 · answered Mar 30 '18 at 01:48

1

I think you want a full outer join:

select storm_id, cell_id,
       coalesce(d1.wind_speed, 0) as wind_speed,
       coalesce(d2.storm_surge, 0) as storm_surge
from dataset1 d1 full join
     dataset2 d2
     using (storm_id, cell_id);

answered Mar 30 '18 at 01:48

Gordon Linoff

1,242,037
58
646
786

knb · Answer 2 · 2018-03-30T16:40:07.750

Shell-Only Solution

Make a backup of your files first

Assuming your files are called wind1.txt and wind2.txt

You could apply these sets of shell commands:

perl -pi -E "s/,/_/" wind*
perl -pi -E 's/(.$)/$1,0/' wind1.txt
perl -pi -E "s/,/,0,/"     wind2.txt
join --header -a 1 -a 2 wind1.txt wind2.txt > outfile.txt

Intermediate Result

Storm_ID_Cell_ID,Wind_speed,0
2_10236258,27,0
2_10236299,0,0.27
2_10236300,0,0.27
2_10236300,58,0
2_10236301,0,0.35
2_10236301,25,0
2_10240400,0,0.35
2_10240401,0,0.81
3_10240400,51,0
4_10240402,0,0.11

Now rename in line 0 to "storm_surge", replace first _ with "," in digits

perl -pi -E "s/Wind_speed,0/Wind_speed,Storm_surge/" outfile.txt
perl -pi -E 's/^(\d+)_/$1,/' outfile.txt
perl -pi -E "s/Storm_ID_Cell_ID/Storm_ID,Cell_ID/" outfile.txt

Intermediate result:

Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,0,0.27
2,10236300,58,0
2,10236301,0,0.35
2,10236301,25,0
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11

Finally run this:

awk 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' outfile.txt  | sort

(Sorry - Q was closed while answering)

score 0 · Answer 3 · answered Mar 31 '18 at 16:31

0

awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n

Since it's a loop, awk produces random order. Hence sorting at the end.

answered Mar 31 '18 at 16:31

Sami

29
1
5

Joining two different datasets using multiple keyvalues

3 Answers3