0

I have two large-ish CSV files. One file is just a list of records. The other file is a list of records, but the first column is the line number of the record that it modifies in the other file. It doesn't replace the whole row; It just replaces the value in the row that has the matching header.

For example:

File 1:

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Perl"
"Dane","Joe","Lisp"

File 2:

"Seq","Lang"
2,"Ruby"

The goal is to end up with one file that looks like this:

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Ruby"
"Dane","Joe","Lisp"

The data is, however, much more complicated than that and could even contain line breaks in the CSV. Thus, I can't rely on the line number and instead I have to rely on the record count. (Unless, of course, I preprocess both files to replace newlines and carriage returns.. which I suppose is possible but less interesting.)

The question I have is how do I loop through both files and do the proper replacement without loading either of the entire files into memory. I believe 100mb+ files loaded into memory is a bad idea, right?

Also, the records in the resulting file should be in the same order when it's done.

Jeremy Baker
  • 3,986
  • 3
  • 24
  • 27
  • Guessing these hand-encoded CSV files? I assume because you are wrapping each string in `"` when that's completely unnecessary for the values you've provided. – maček Dec 28 '12 at 09:55
  • 1
    is File 2's 'Seq' column in order? – pguardiario Dec 28 '12 at 10:01
  • I just typed out examples by hand. In the case of the real files, one of them doesn't use quotes unless necessary and the other forces quotes. CSV should parse them either way, with or without quotes. – Jeremy Baker Dec 28 '12 at 10:01
  • pguardiario: Yes, it should always be in order. – Jeremy Baker Dec 28 '12 at 10:02

2 Answers2

1

This is basically how I'd handle it if the files were too large to load into memory

// pseudocode

f1 = fopen(file1)
f2 = fopen(file2)
f3 = fopen(newfile)

// loop through exceptions
foreach row2, index2 of f2

  // loop through file1 until a matched row is found
  while (row1, index1 of f1) && (row1 not null) && (row2[seq] <= index1)

    // patch
    if row2[seq] == index1
      row1[lang] = row2[lang]
    endif

    // write out to new file
    f3.write row1

  endwhile
endforeach

† Since your file2 has 1-based indices (instead of being 0-based), you will want to start your index1 and index2 counters at 1.


†† If lang isn't the column you will always be replacing:

// at the beginning of the foreach loop
if col is null
  cols = array_keys row2
  col = cols[2] // 1-based index
end

// the new patch block
if row2[seq] == index1
  row1[col] = row2[col]
endif

maček
  • 76,434
  • 37
  • 167
  • 198
1

You will need 2 enumerators, but since they're not nested, one will need to use Enumerator#next, which means you need to be careful about it raising a EOF exception:

e = CSV.open('file2.csv', :headers => true).each
seq = e.next

output = CSV.open('output.csv', 'w')

csv = CSV.open('file1.csv')
csv.each do |row|
  if seq['Seq'].to_i == csv.lineno - 1
    row[2] = seq['Lang']
    seq = e.next rescue ({'Seq' => -1})
  end
  output << row
end
pguardiario
  • 53,827
  • 19
  • 119
  • 159