12

I have 2 CSV files:

file_1 columns: id,user_id,message_id,rate
file_2 columns: id,type,timestamp

The relation between the files is that file_1.message_id = files_2.id.

I want to create a 3rd file that will have the following columns:

file_1.id,file_1.user_id,file_1.message_id,file_1.rate,file_2.timestamp

Any ideas on how to do this in Linux?

Dan Fego
  • 13,644
  • 6
  • 48
  • 59
Ran
  • 3,455
  • 12
  • 47
  • 60

4 Answers4

11

You can use the join command like this:

join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 2.3 <(sort -t, -k 3,3 file1) <(sort file2)

It first sorts the files (file1 is sorted by the 3rd field) and then joins them using the 3rd field of file1 and the 1st field of file2. It then outputs the fields you need.

myrdd
  • 3,222
  • 2
  • 22
  • 22
dogbane
  • 266,786
  • 75
  • 396
  • 414
  • im getting: join: file 2 is not in sorted order join: file 1 is not in sorted order – Ran Jan 11 '12 at 18:01
  • 1
    even when you run it thru the sort command as shown in dogbanes solution ? (sort ) only the is changed, include sort! – David Chan Jan 12 '12 at 05:31
  • there's probably a bug here... afaik join only works if the files are sorted by the join key. since you're calling sort on file1, it will be sorted by column 1 which is not the sort key – Pradeep Gollakota Oct 08 '14 at 18:27
  • 2
    this should probably be changed to the following command... `join -t -1 3 -2 1 -o 1.1 1.2 1.3 1.4 2.3 <(sort -t $',' -k 3,3 file1) <(sort file2)` in order to sort by the correct join key – Pradeep Gollakota Oct 08 '14 at 18:30
  • @PradeepGollakota almost. You forgot the comma in `join -t, ...`. Also, I think `sort -t $','` and `sort -t,` are equivalent, but the latter variant is shorter. – myrdd Apr 23 '18 at 08:08
4

Seems to be a job for SQLite. Using the SQLite shell:

 create table f1(id,user_id,message_id,rate);
 create table f2(id,type,timestamp);

 .separator ,
 .import 'file_1.txt' f1
 .import 'file_2.txt' f2

 CREATE INDEX i1 ON f1(message_id ASC); -- optional
 CREATE INDEX i2 ON f2(id ASC);         -- optional

 .output 'output.txt'
 .separator ,

 SELECT f1.id, f1.user_id, f1.message_id, f1.rate, f2.timestamp
   FROM f1
   JOIN f2 ON f2.id = f1.message_id;

 .output stdout
 .q

Note that if there is a single error in the number of commas in a single line the import stage will fail. You can prevent the rest of the script from running with .bail on at the script beginning.

If you want unmatched ids you can try:

SELECT f1.* FROM f1 LEFT JOIN f2 on f2.id = f1.message_id WHERE f2.id IS NULL

Which will select every row from f1 for which no corresponding row in f2 has been found.

Benoit
  • 76,634
  • 23
  • 210
  • 236
  • I hesitated to initially comment "use a proper database," but this seems to do the job with the database as a temporary mechanism. +1 – Dan Fego Jan 11 '12 at 14:47
  • each file has about 70M records, i think it will be "hard" for the db – Ran Jan 11 '12 at 14:51
  • @Ran: absolutely not if you create indexes. `CREATE INDEX i1 ON f1(message_id); CREATE INDEX i2 ON f2(id);` You can create indexes after the `.import` stage. If those column values are unique you can declare them as `PRIMARY KEY` in the create table statements, then no need to use indexes. Use `.bail on` to exit at the slightest error. – Benoit Jan 11 '12 at 14:52
  • @Ran: but note that if you have lots of data it's probably more efficient to use an intermediate database file. Run `sqlite foo.db` instead of `sqlite`. – Benoit Jan 11 '12 at 15:06
0

With awk you can try something like this -

awk -F, 'NR==FNR{a[$3]=$0;next} ($1 in a){print a[$1]","$3 > "file_3"}' file_1 file_2

Test:

[jaypal:~/Temp] cat file_1     # Contents of File_1
id,user_id,message_id,rate
1,3334,424,44

[jaypal:~/Temp] cat file_2     # Contents of File_2
id,type,timestamp
424,rr,22222

[jaypal:~/Temp] awk -F, 'NR==FNR{a[$3]=$0;next} ($1 in a){print a[$1]","$3 > "file_3"}' file_1 file_2

[jaypal:~/Temp] cat file_3     # Contents of File_3 made by the script
1,3334,424,44,22222
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • i dont need the "file_1." in the output file, it was just to show from which file i need each column – Ran Jan 11 '12 at 15:31
  • Ok thank god coz that was getting a little messy. I have updated the answer. Hope that helps! :) – jaypal singh Jan 11 '12 at 15:35
  • @Ran I did a test with a sample data. Looks like it works. Let me know if you run in to any issues or if you don't understand anything and I will walk you thru it. – jaypal singh Jan 11 '12 at 20:33
  • is awk the right solution if each of the files have ~80M records? – Ran Jan 12 '12 at 09:10
  • It basically boils down to how much time you want to invest. A good solution would be to write an efficient program in compiled language rather than a interpreted. You can give it a try and see the performance. With faster machines, performance of a interpreted/scripting language isn't that bad. – jaypal singh Jan 12 '12 at 09:25
  • and how about using `join` ? any ideas how to do it there? – Ran Jan 12 '12 at 10:53
0

You can try this:
1. Change all lines to start with the key:

awk -F',' { print $3 " file1 " $1 " " $2 " " $4 } < file1 >  temp
awk -F',' { print $1 " file2 " $2 " " $3 }        < file2 >> temp

Now the lines look like:

message_id file1 id user_id rate
id file2 type timestamp
  1. Sort temp by the first two columns. Now related lines are adjacent, with file1 first

    sort -k 1,1 -k 2,2 < temp > temp2

  2. Run awk to read the lines. In file1 lines save the fields, in file2 lines print them.

ugoren
  • 16,023
  • 3
  • 35
  • 65