3

I want to parse two CSV files of the MaxMind GeoIP2 database, do some joining based on a column and merge the result into one output file.

I used standard CSV ruby library, it is very slow. I think it tries to load all the file in memory.

block_file = File.read(block_path)
block_csv   = CSV.parse(block_file, :headers => true) 
location_file = File.read(location_path)
location_csv = CSV.parse(location_file, :headers => true)


CSV.open(output_path, "wb",
    :write_headers=> true,
    :headers => ["geoname_id","Y","Z"] ) do |csv|


    block_csv.each do |block_row|
    puts "#{block_row['geoname_id']}"

        location_csv.each do |location_row|
            if (block_row['geoname_id'] === location_row['geoname_id'])
                puts " match :"    
                csv << [block_row['geoname_id'],block_row['Y'],block_row['Z']]
                break location_row
            end
        end

    end

Is there another ruby library that support processing in chuncks ?

block_csv is 800MB and location_csv is 100MB.

gioele
  • 9,748
  • 5
  • 55
  • 80
Okacha Ilias
  • 53
  • 1
  • 6

1 Answers1

6

Just use CSV.open(block_path, 'r', :headers => true).each do |line| instead of File.read and CSV.parse. It will parse the file line by line.

In your current version, you explicitly tell it to read all the file with File.read and then to parse the whole file as a string with CSV.parse. So it does exactly what you have told.

Oleg K.
  • 1,549
  • 8
  • 11