I have a 'master' file with a number of columns: 1 2 3 4 5. I have a few other files, with fewer rows than the master file, each with columns: 1 6. I'd like to merge these files matching on the column 1 field and add column 6 to the master. I've seen some python/UNIX solutions but would prefer to use ruby/fastercsv if it's a good fit. I would appreciate any help getting started.
-
Are the files sorted by column 1? Are they big? Can you fit other files in memory? – Mladen Jablanović Oct 30 '11 at 19:05
-
Files can be sorted by column one but the files I'm matching to the master have a subset of rows. Files can all fit in memory, master file at the largest is 2000 rows, 20 columns. Smaller files are < 270 rows. Thanks for the question... – chuckfinley Oct 30 '11 at 20:15
3 Answers
FasterCSV is now the default CSV implementation in Ruby 1.9. This code is untested, but should work.
require 'csv'
master = CSV.read('master.csv') # Reads in master
master.each {|each| each.push('')} # Adds another column to all rows
Dir.glob('*.csv').each do |each| #Goes thru all csv files
next if each == 'master.csv' # skips the master csv file
file = CSV.read(each) # Reads in each one
file.each do |line| #Goes thru each line of the file
temp = master.assoc(line[0]) # Finds the appropriate line in master
temp[-1] = line[1] if temp #updates last column if line is found
end
end
csv = CSV.open('output.csv','wb') #opens output csv file for writing
master.each {|each| csv << each} #Goes thru modified master and saves it to file

- 4,773
- 1
- 25
- 52
-
Thanks Kassym. Getting an error I'm trying to troubleshoot (ruby 1.8.7):`combine.rb:8: syntax error, unexpected kIN in = master.assoc(line[0]) # Finds the appropriate line in master` – chuckfinley Nov 02 '11 at 14:06
-
If you are using ruby 1.8.7 make sure that you change CSV to FasterCSV. I've also updated my post, made a stupid stupid mistake... – Kassym Dorsel Nov 02 '11 at 14:32
-
Not pointing any fingers here. Still getting an error: combine.rb:15: undefined local variable or method `csv' for main:Object (NameError). Not enough room here to show all the code but I changed all instances of CSV to FasterCSV with the exception of CSV.open -- made that FCSV.open, was that right? Also changed require 'fastercsv'. Thanks again, hope this isn't the albatross code... – chuckfinley Nov 02 '11 at 23:37
-
Sorry about that. As is, this code runs on 1.8.7 and 1.9.3 without any problems, I've fully tested it now. – Kassym Dorsel Nov 03 '11 at 03:09
-
Kassym, problem was on my side. Was getting `CSV::IllegalFormatError` , turned out to be my Mac CRs, changed to LF and all is good. Thank you so much for your time & help. – chuckfinley Nov 03 '11 at 15:59
$ cat j4.csv
how, now, brown, cow, f1
now, is, the, time, f2
one, two, three, four, five
xhow, now, brown, cow, f1
xnow, is, the, time, f2
xone, two, three, four, five
$ cat j4a.csv
how, b
one, d
$ cat hj.rb
require 'pp'
require 'rubygems'
require 'fastercsv'
pp(
FasterCSV.read('j4a.csv').inject(
FasterCSV.read('j4.csv').inject({}) do |m, e|
m[e[0]] = e
m
end) do |m, e|
k = e[0]
m[k] << e.last if m[k]
m
end.values)
$ ruby hj.rb
[["now", " is", " the", " time", " f2"],
["xhow", " now", " brown", " cow", " f1"],
["xone", " two", " three", " four", " five"],
["how", " now", " brown", " cow", " f1", " b"],
["one", " two", " three", " four", " five", " d"],
["xnow", " is", " the", " time", " f2"]]
This works by mapping your master file into a hash with column one as the key, and then it just looks up the key from your other files. As written the code appends the last column when the keys match. Since you have more than one non-master file, you could adapt the concept by replacing FasterCSV.read('j4a.csv')
with a method that reads each file and concatenates them all into a single array of arrays, or you could just save the result from the inner inject
(the master hash) and apply each other file to it in a loop.

- 143,651
- 25
- 248
- 329
-
Thanks DigitalRoss, this is very useful code, it took me a while to walk-through & see what was hapening... – chuckfinley Nov 03 '11 at 16:14
temp = master.assoc(line[0])
The above is a super slow process. The whole complex is at least O(n^2).
I would use the below process:
- for 1 6 csv, convert it to a big hash with 1 as key and 6 as value, named as 1_to_6_hash
- loop 1 2 3 4 5 csv row by row, set row[6] = 1_to_6_hash[row[1]]
It will dramatically reduce complex to O(n)

- 3,311
- 5
- 28
- 39