0

I have two .csv files and I would like to query column 3 from File 1 against column 3 from File 2. The output file should consist of column 3 from File 1, and if the entry exists in File 2, print the entire corresponding row from File 2. If no match in File 2, print File 1 column 3 and leave rest blank (see below).

File 1:

...   ...   a   ...
...   ...   e   ...
...   ...   b   ...
...   ...   c   ...

File 2:

...   ...   a   a-info-1   a-info-2   a-info-n
...   ...   c   c-info-1   c-info-2   c-info-n
...   ...   d   d-info-1   d-info-2   d-info-n
...   ...   e   e-info-1   e-info-2   e-info-n
...   ...   f   f-info-1   f-info-2   f-info-n

Desired output:

a   ...   ...   a-info-1   a-info-2   a-info-n
e   ...   ...   e-info-1   e-info-2   e-info-n
b   
c   ...   ...   c-info-1   c-info-2   c-info-n

I have tried to accomplish this in both R and bash. I thought I should be able to figure this out by referencing this thread: awk compare 2 files, 2 fields different order in the file, print or merge match and non match lines, but I am very much new to all things programming and can't seem to figure out how to transcribe the solution to my case.

My best try:

awk -F"," 'NR==FNR{a[$3]=$3;next}{if ($3 in a)print a[$3]","$0;}' file1.csv file2.csv  > output.csv

The problem with this code is that it does not print the entries from File 1 that do not have entries in File 2.

If your solutions could be overly explanatory, I would very much appreciate it!

Luc O
  • 3
  • 2

1 Answers1

0

From your example, I understand that you want to lookup the rows in second file by the values in the third column of first file.

Assuming your data frames are named df1 and df2, let df1 and df2 be

> df1
  col1 col2 col3
1   48   39    a
2   26   42    e
3   45   30    b
4   40   46    c

> df2
  col1 col2 col3   col4    col5
1   49   47    a info a info2 a
2   27   45    b info b info2 b
3   28   37    c info c info2 c
4   37   26    e info e info2 e
5   25   40    f info f info2 f

The first step is to identify the rows in the second file,

rn <-  match(df1$col3, df2$col3)

This will create the vector of the row names, based on which the file 2 can be filtered

out <- cbind(df1[3], df2[rn,][-3])

Here, the column from first data frame is combined with the all the columns except third of the looked-up rows of second data frame.

The new data frame will have the row names from the vector which we had created earlier. Also, if there are some unidentified rows it will create NA in the data frame which can be replaced with an empty character.

rownames(out) <- rownames(df1)
out[is.na(out)] <- ""

The output will be like this:

> out
  col3 col1 col2   col4    col5
1    a   49   47 info a info2 a
2    e   37   26 info e info2 e
3    b                         
4    c   28   37 info c info2 c
Mohanasundaram
  • 2,889
  • 1
  • 8
  • 18