Ruby - Read file in batches

Question

I am reading a file that is 10mb in size and which contains some id's. I read them into a list in ruby. I am concerned that it might cause memory issues in the future, when the number of id's in file might increase. Is there a effective way of reading a large file in batches?

Thank you

Eric Duminil · Answer 1 · 2021-05-07T20:36:10.400

40

With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size can be chosen freely.

header_lines = 1
batch_size   = 2000

File.open("big_file") do |file|
  file.lazy.drop(header_lines).each_slice(batch_size) do |lines|
    # do something with batch of lines
  end
end

It could be used to import a huge CSV file into a database:

require 'csv'
batch_size   = 2000

File.open("big_data.csv") do |file|
  headers = file.first
  file.lazy.each_slice(batch_size) do |lines|
    csv_rows = CSV.parse(lines.join, headers: headers)
    # do something with 2000 csv rows, e.g. bulk insert them into a database
  end
end

edited May 07 '21 at 20:36

answered Dec 09 '16 at 21:44

Eric Duminil

52,989
9
71
124

1

I vaguely remembered this, is the answer I was looking for. The right way to read a file! – Achyut Rastogi Jan 02 '17 at 21:45
how should it work? What is the purpose of using `lazy`? – Ilya Aug 10 '17 at 12:16
2

I did some benchmarks with a huge amount of data, memory usage is the same as without `.lazy.each_slice` chain. – Ilya Aug 10 '17 at 12:17
@Ilya thanks for the benchmark. I'll investigate after my holidays. – Eric Duminil Aug 10 '17 at 16:58
Cannot understand the magic too. Does this reads the whole file into memory and then iterates through its lines? – sunki Nov 28 '17 at 15:47
`.lazy` is needed when using `drop(header_lines)`, which will read return an Array (and thus read the whole file into memory) if not lazy. – Edward Anderson Sep 11 '20 at 14:32

score 11 · Accepted Answer · answered Jun 02 '10 at 22:53

there's no universal way.

1) you can read file by chunks:

File.open('filename','r') do |f|
  chunk = f.read(2048)
  ...
end

disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk

2) you can read file line-by-line

File.open('filename','r') do |f|
  line = f.gets
  ...
end

disadvantage: this way it'd be 2x..5x slower than first method

score -1 · Answer 3 · answered May 14 '21 at 15:32

-1

If you're worried this much about speed/memory efficiency, have you considered shelling out to the shell and use grep, awk, sed etc.? If I knew a bit more about the structure of the input file and what you're trying to extract, I could potentially construct a command for you.

answered May 14 '21 at 15:32

Clemens Kofler

1,878
7
11

Sorry, the question is specifically about ruby. It wouldn't make sense to use shell commands for a canonical ruby question. – Eric Duminil May 14 '21 at 17:22
The question says the author wants to "read IDs into a list in Ruby". Nowhere does it say that the reading needs to happen in Ruby – only the storing in the list. Also, shelling out isn't a special thing – it doesn't require extra libraries or whatever, it's just a feature of the language. So I don't quite follow your line of argument. – Clemens Kofler May 17 '21 at 07:27
I guess you're trying to help, but your answer isn't really useful, and should have been a comment. The question is specifically tagged Ruby & Ruby-on-Rails, and I posted a bounty for a Ruby answer. Ruby has excellent text processing capabilities, Rails runs on systems which don't have grep/awk/sed, and Ruby code can be much more readable than awk. Care needs to be taken not to use too much memory, and that's what the question is about. – Eric Duminil May 17 '21 at 07:42
Another option that comes to mind is to use the `split` command offered by Linux: You could use `split -l 1000` to split the input file into separate equally sized files and then process them one-by-one with Ruby, thus keeping most of the logic in Ruby while having the file size (and consequently also memory usage) relatively constant even as the overall number of lines grows. – Clemens Kofler May 17 '21 at 16:44

Ruby - Read file in batches

3 Answers3

Linked