0

I need to be able to read the first (header) row in big xlsx file (350k x 12 cells, ~30MB) very fast in Ruby on Rails app. I am using Roo gem at the moment, which is fine for smaller files. But for files this big it takes 3-4 minutes. Is there a way to do this in seconds?

xlsx = Roo::Spreadsheet.open(file_path)
sheet = xlsx.sheet(0)
header = sheet.row(1)

Edit:

  • I tried other gems:
    • rubyXL took several minutes
    • creek was the fastest with 30s. But still unusable in controller

Edit2:

  • I ended up using creek in a job and polling for the result in controller. Thx Tom Lord for suggesting creek
GTO
  • 127
  • 11
  • Maybe it will not change the parsing time dramatically, but there was a comparison of different parsers: https://stackoverflow.com/questions/3321011/parsing-xls-and-xlsx-ms-excel-files-with-ruby – Pavel Oganesyan Dec 07 '21 at 14:00

2 Answers2

1

The ruby gem roo does not support file streaming; it reads the whole file into memory. Which, as you say, works fine for smaller files but not so well for reading small sections of huge files.

You need to use a different library/approach. For example, you can use the gem: creek, which describes itself as:

a Ruby gem that provides a fast, simple and efficient method of parsing large Excel (xlsx and xlsm) files.

And, taking the example from the project's README, it's pretty straightforward to translate the code you wrote for roo into code that uses creek:

require 'creek'
creek = Creek::Book.new(file_path)
sheet = creek.sheets[0]
header = sheet.rows[0]

Note: A quick google of your StackOverflow question title led me to this blog post as the top search result. It's always worth searching on Google first.

Tom Lord
  • 27,404
  • 4
  • 50
  • 77
  • I have been using creek gem for many years and it's wonderful and it's very fast as the description says, this gem can be installed only with devkit. – Rajagopalan Dec 08 '21 at 00:52
  • Thank you for the response! I tried rubyXL, which took several minutes and I tried creek, which was the fastest with 30s. But still unusable in controller – GTO Dec 08 '21 at 09:07
  • @GTO I haven't actually tried running this with a large file, but *in theory* the idea of `creek` is that you can *stream* the file instead of reading the whole thing into memory. Does your code look something like the above, calling `Creek::Book.new`, or are you still trying to read the whole file into memory? – Tom Lord Dec 10 '21 at 13:59
  • Yea, I am calling `Creek::Book.new`, this line still takes about 30s – GTO Dec 13 '21 at 12:12
-1

Using #getscould work, maybe something like:

first_line_data = File.open(file_path, "rb", &:gets)
first_line_file = File.open("tmp_file.xlsx", "wb") { |f| f << first_line_data }
xlsx = Roo::Spreadsheet.open("tmp_file.xlsx")
# etc...
obiruby
  • 379
  • 1
  • 6
  • 2
    This would work on a plain-text `csv` file, but not on a binay `xlsx` file. You cannot call `gets` on the file to obtain "the first row of the first sheet". – Tom Lord Dec 07 '21 at 16:31