Merge two csv files if Id columns match

Question

I have the following:

file1.csv

"Id","clientName1","clientName2"

file2.csv

"Id","Name1","Name2"

I want to read file1 sequentially. For each record, I want to check if there is a matching Id in file2. There may be more than one match. For each match, I want to append Name1, Name2 to the end of the record of file1.csv

So, possible result, if a record has more than one match in file2:

"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"

Does it have to be bash? I would suggest more sophisticated language that allows you to open up a file as a database and then just do a `LEFT JOIN` on the two or something like that. Or even just use perl and slurp them both into hashes and do some quick manipulation in perl. — Mort, Oct 26 '17 at 02:15
I'm not familiar with those things, however. I need to do this as quickly and easily as possible. Preferably on the command line or in a bash script, but I do know Java, PHP, etc. if it's easy to do it with those languages. — Ken, Oct 26 '17 at 11:42
Sorry, I need to clarify that file1 may have duplicate Ids, it actually has many more fields than shown also, but file2 is just those three fields. — Ken, Oct 26 '17 at 14:59

tshiono · Answer 1 · 2017-10-27T01:18:16.580

0

I'm afraid bash may not be the efficient solution but the following bash script would work:

#!/bin/bash

declare -A id_hash

while read line; do
    id=$(echo $line | cut -d ',' -f 1)
    name=$(echo $line | cut -d ',' -f 2-)
    if [ -z "${id_hash[$id]}" ]; then
        id_hash[$id]=$name
    else
        id_hash[$id]=${id_hash[$id]},$name
    fi
done < file1.csv

while read line; do
    id=$(echo $line | cut -d ',' -f 1)
    name=$(echo $line | cut -d ',' -f 2-)
    if [ -z "${id_hash[$id]}" ]; then
        id_hash[$id]=$name
    else
        id_hash[$id]=${id_hash[$id]},$name
    fi
done < file2.csv

for id in ${!id_hash[@]}; do
    echo $id,${id_hash[$id]}
done

edited Oct 27 '17 at 01:18

answered Oct 26 '17 at 01:09

tshiono

21,248
2
14
22

Sorry, I need to clarify that file1 may have duplicate Ids, it actually has many more fields than shown also, but file2 is just those three fields. Would this still work as shown? – Ken Oct 26 '17 at 15:02
In case file1 has duplicate IDs, add `if .. else .. fi` lines as shown in while loop for file2. I've modified the code above accordingly. Regarding #fields, it accepts arbitrary count of fields, which is represented in the `cut -f 2-` part. – tshiono Oct 27 '17 at 01:24

αғsнιη · Answer 2 · 2017-11-01T12:45:16.303

0

In response to the OP's clarification in his/her comment, here is the revised version of the single awk command which does merge in case there was duplicated IDs either in file1 or file2 or in both and if with different number of fields. old version which it works for OP's current stated question

awk -F',' '{one=$1;$1="";a[one]=a[one]$0} END{for (i in a) print i""a[i]}' OFS=, file[12]

For the inputs:

file1

"Id1","clientN1","clientN2"
"Id2","Name3","Name4"
"Id3","client00","client01","client02"
"Id1","client1","client2","client3"

file2

"Id1","Name1","Name2"
"Id1","Name3","Name4"
"Id2","Name0","Name1"
"Id2","Name00","Name11","Name22"

The output is merged file1 and file2 on same IDs:

"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"

edited Nov 01 '17 at 12:45

answered Oct 26 '17 at 05:04

αғsнιη

2,627
2
25
38

Yours is the first answer here that I'm capable of, so I'm going to try it first. It's exactly what I want if it works. I'll let you know. – Ken Oct 26 '17 at 11:51
Actually, I need a little clarification. Is the letter 'a' to be replaced with the file names? Can you tell me which variables are to be replaced with what? Thx. – Ken Oct 26 '17 at 12:12
Sorry, I also need to clarify that file1 may have duplicate Ids, it actually has many more fields than shown also, but file2 is just those three fields. – Ken Oct 26 '17 at 15:01
@Ken see my update, about `a` no it's an array named called `a`, files are `file[12]` which this expand to `file1 file2` (just shortened). – αғsнιη Nov 01 '17 at 12:47

etopylight · Accepted Answer · 2017-10-26T17:56:40.337

0

A regex solution by using join and GNU sed

join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'

assume that both file1.csv and file2.csv are sorted by id and without header

file1.csv

1,c11,c12
2,c21,c22
3,c31,c32

file2.csv

1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42

gives a result of

1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32

UPDATE

In case where file1.csv might contain duplicate ids and various field lengths, I would suggest to perform a pre-process to make sure file1.csv is clean before joining with file2.csv

awk -F, '{for(i=2;i<=NF;i++) print $1 FS $i}' file1.csv |\
    sort -u |\
    sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'

the first awk process splits all data into (id, name) pairs
sort -u sorts and uniques each pairs
the last sed process merge all pairs with same ids into a single row

input

1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22

output

1,c11,c12,c13,c14,c15
2,c21,c22

edited Oct 26 '17 at 17:56

answered Oct 26 '17 at 09:08

etopylight

1,239
1
10
15

I'm a bit confused about file[12].csv. I assume that's to be replaced with a file name. Can you tell me exactly which variables need to be replaced with what? Thx. – Ken Oct 26 '17 at 13:12
@Ken `file[12].csv` is basically equal to `file1.csv file2.csv`, it is just a cleaner way to represent similar files with same prefixes or suffixes – etopylight Oct 26 '17 at 13:23
Thx but I was just simplifying it when I wrote that. They are two completely different file names. How would I code your line in that case? Or, I can try this anyway but changing the file names temporarily. – Ken Oct 26 '17 at 13:30
@Ken In that case, you just need to replace `file[12].csv` with say `clientname.csv name.csv` so eventually the full command would look like `join -t , -a 1 clientname.csv name.csv | sed -E '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'` those filenames are just placeholders – etopylight Oct 26 '17 at 13:37
I just ran this and got error: sed: invalid option -- E. I ran it on the command line. Do you know what needs to be adjusted? – Ken Oct 26 '17 at 13:45
@Ken what version of sed or operating system are you currently using? – etopylight Oct 26 '17 at 13:50
If option `-E` is not available, you can try to replace the sed part with `sed '$!N;/^$.*,$$.*$\n\1/!P;s//\n\1\2,/;D'` – etopylight Oct 26 '17 at 13:53
On Linux. I just typed in uname -r and got 2.6.18-348.3.1.el5xen – Ken Oct 26 '17 at 13:55
@Ken Ok, it seems to be my fault, the option to enable extended regexp should be `-r` in `GNU sed` which I mistaken it for `-E` in `BSD sed`, (answer edited) – etopylight Oct 26 '17 at 14:06
Would your code need to be different if file1 contains duplicate Id's? I ran it and that may be causing a problem. – Ken Oct 26 '17 at 14:45
@Ken By saying duplicate ids, do you mean that file1 might contain rows with same ids but with different client names for example `id1, cn1, cn2` and `id1, cn1, cn3` or do you mean by plainly duplicate rows? – etopylight Oct 26 '17 at 15:14
File1 has many more fields than shown also, but file2 is just those three fields. – Ken Oct 26 '17 at 15:16
They're not duplicate rows but they may start with the same Id and names. As mentioned, I neglected to say that a file1 record actually contains many more fields. My bad if that matters. Sorry. – Ken Oct 26 '17 at 15:19
@Ken No problem :) just want to make sure that I understand the requirements. Answer updated for the case you mentioned. – etopylight Oct 26 '17 at 18:06
To etopylight... My supervisor told me I could do it with an outer loop reading file1 and a nested loop reading file2, and comparing ids. I didn't realize that I could write logic like that in a bash script, just like programming. So it's done. However, I applaud you for trying to help and sticking with me through over a dozen comments. Thanks a million. Just tried to vote for your answer but it wouldn't record. – Ken Oct 26 '17 at 20:44
@Ken Nice to know that you have solved it by your own! By the way, it would be great if you can also post your bash solution too, it might be quite helpful to those who encounter the same problem in the future. – etopylight Oct 27 '17 at 03:02

Ken · Answer 4 · 2017-10-27T12:07:43.667

Thanks to all but this has been completed. The code I wrote is below:

#!/bin/bash

echo
echo 'Merging files into one'

IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2

do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"

  while read id2 cwlname cwfname
  do
       if [ $id == $id2 ]
       then
           var="$var,$cwlname,$cwfname"
       fi

  done < file2.csv

  echo "$var" >> /root/scijoinedfile.csv

done < file1.csv

echo
echo "Merging completed"

Merge two csv files if Id columns match

4 Answers4