18

I am reading a file that is 10mb in size and which contains some id's. I read them into a list in ruby. I am concerned that it might cause memory issues in the future, when the number of id's in file might increase. Is there a effective way of reading a large file in batches?

Thank you

Boolean
  • 14,266
  • 30
  • 88
  • 129

3 Answers3

40

With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size can be chosen freely.

header_lines = 1
batch_size   = 2000

File.open("big_file") do |file|
  file.lazy.drop(header_lines).each_slice(batch_size) do |lines|
    # do something with batch of lines
  end
end

It could be used to import a huge CSV file into a database:

require 'csv'
batch_size   = 2000

File.open("big_data.csv") do |file|
  headers = file.first
  file.lazy.each_slice(batch_size) do |lines|
    csv_rows = CSV.parse(lines.join, headers: headers)
    # do something with 2000 csv rows, e.g. bulk insert them into a database
  end
end
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
11

there's no universal way.

1) you can read file by chunks:

File.open('filename','r') do |f|
  chunk = f.read(2048)
  ...
end

disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk

2) you can read file line-by-line

File.open('filename','r') do |f|
  line = f.gets
  ...
end

disadvantage: this way it'd be 2x..5x slower than first method

zed_0xff
  • 32,417
  • 7
  • 53
  • 72
-1

If you're worried this much about speed/memory efficiency, have you considered shelling out to the shell and use grep, awk, sed etc.? If I knew a bit more about the structure of the input file and what you're trying to extract, I could potentially construct a command for you.

Clemens Kofler
  • 1,878
  • 7
  • 11
  • Sorry, the question is specifically about ruby. It wouldn't make sense to use shell commands for a canonical ruby question. – Eric Duminil May 14 '21 at 17:22
  • The question says the author wants to "read IDs into a list in Ruby". Nowhere does it say that the reading needs to happen in Ruby – only the storing in the list. Also, shelling out isn't a special thing – it doesn't require extra libraries or whatever, it's just a feature of the language. So I don't quite follow your line of argument. – Clemens Kofler May 17 '21 at 07:27
  • I guess you're trying to help, but your answer isn't really useful, and should have been a comment. The question is specifically tagged Ruby & Ruby-on-Rails, and I posted a bounty for a Ruby answer. Ruby has excellent text processing capabilities, Rails runs on systems which don't have grep/awk/sed, and Ruby code can be much more readable than awk. Care needs to be taken not to use too much memory, and that's what the question is about. – Eric Duminil May 17 '21 at 07:42
  • Another option that comes to mind is to use the `split` command offered by Linux: You could use `split -l 1000` to split the input file into separate equally sized files and then process them one-by-one with Ruby, thus keeping most of the logic in Ruby while having the file size (and consequently also memory usage) relatively constant even as the overall number of lines grows. – Clemens Kofler May 17 '21 at 16:44