2

I have a CSV file that I want to read with Ruby and create Ruby objects to insert into a MySQL database with Active Record. What's the best way to do this? I see two clear options: FasterCSV & the Ruby core CSV. Which is better? Is there a better option that I'm missing?

EDIT: Gareth says to use FasterCSV, so what's the best way to read a CSV file using FasterCSV? Looking at the documentation, I see methods called parse, foreach, read, open... It says that foreach "is intended as the primary interface for reading CSV files." So, I guess I should use that one?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
ma11hew28
  • 121,420
  • 116
  • 450
  • 651

2 Answers2

4

Ruby 1.9 adopted FasterCSV as its core CSV processor, so I would say it's definitely better to go for FasterCSV, even if you're still using Ruby 1.8

Gareth
  • 133,157
  • 36
  • 148
  • 157
  • What's the option for `foreach` to skip the first row (header)? – ma11hew28 Dec 06 '10 at 18:09
  • The documentation (http://ruby-doc.org/ruby-1.9/classes/CSV.html#M001337) says that the options for foreach are the same as the options for CSV.new - http://ruby-doc.org/ruby-1.9/classes/CSV.html#M001346 – Gareth Dec 06 '10 at 18:12
  • 1
    I understood that. I was getting tripped up cause I was trying `:headers => false`. I think I got it now. `:headers => true` is what I wanted. It treats the first row as the header row and starts with the second row. Haha. I was thinking about it in the opposite way. – ma11hew28 Dec 06 '10 at 18:16
1

If you have a lot of records to import you might want to use MySQL's loader. It's going to be extremely fast.

LOAD DATA INFILE can be used to read files obtained from external sources. For example, many programs can export data in comma-separated values (CSV) format, such that lines have fields separated by commas and enclosed within double quotation marks, with an initial line of column names. If the lines in such a file are terminated by carriage return/newline pairs, the statement shown here illustrates the field- and line-handling options you would use to load the file:

LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
  FIELDS TERMINATED BY ',' ENCLOSED BY '"'
  LINES TERMINATED BY '\r\n'
  IGNORE 1 LINES;

If the input values are not necessarily enclosed within quotation marks, use OPTIONALLY before the ENCLOSED BY keywords.

Use that to pull everything into a temporary table, then use ActiveRecord to run queries against it to delete records that you don't want, then copy from the temp table to your production one, then drop or truncate the temp. Or, use ActiveRecord to search the temporary table and copy the records to production, then drop or truncate the temp. You might even be able to do a table-to-table copy inside MySQL or append one table to another.

It's going to be tough to beat the speed of the dedicated loader, and using the database's query mechanism to process records in bulk. The step of turning a record in the CSV file into an object, then using the ORM to write it to the database adds a lot of extra overhead, so unless you have some super difficult validations requiring Ruby's logic, you'll be faster going straight to the database.


EDIT: Here's a simple CSV header to DB column mapper example:

require "csv"

data = <<EOT
header1, header2, header 3
1, 2, 3
2, 2, 3
3, 2, 3
EOT

header_to_table_columns = {
  'header1'  => 'col1',
  'header2'  => 'col2',
  'header 3' => 'col3'
}

arr_of_arrs = CSV.parse(data)
headers = arr_of_arrs.shift.map{ |i| i.strip }
db_cols = header_to_table_columns.values_at(*headers)
arr_of_arrs.each do |ary|
  # insert into the database using an ORM or by creating insert statements
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303