1

I'm currently trying to open and parse this xls file using ruby 2.1.1. The straightforward way would be to use the simple-spreadsheet gem, which doesn't seem to work properly for this particular spreadsheet (and a couple others):

require 'simple-spreadsheet'
s=SimpleSpreadsheet::Workbook.read('151.xls')
puts s.last_row #prints 5

However, when I copy and paste this exact code into irb, I get the right answer

2.1.1 :001 > require 'simple-spreadsheet'
 => true 
2.1.1 :002 > s=SimpleSpreadsheet::Workbook.read('151.xls')
 => #prints the entire contents of the spreadsheet
2.1.1 :003 > s.last_row
 => 154

The same behaviour occurs when using Roo (which is to be expected, as simple-spreadsheet uses Roo to open .xls files):

require 'roo'
s=Roo::Excel.new('151.xls')
puts s.last_row #prints 5, should print 154

While on irb

2.1.1 :001 > require 'simple-spreadsheet'
 => true 
2.1.1 :002 > s=Roo::Excel.new('151.xls')
 => #prints the entire contents of the spreadsheet
2.1.1 :003 > s.last_row
 => 154

Digging further, I tried using spreadsheet, as it is required on Roo's excel.rb file:

require 'spreadsheet'    
Spreadsheet.open('151.xls') do |book|
  rows=0
  book.worksheet(0).each do |row|
    rows+=1
  end
  puts rows#prints 5
end

However, this is where it gets weird(er); when I copy and paste this last code into irb, I get

2.1.1 :001 > require 'spreadsheet'    
 => true 
2.1.1 :002 > Spreadsheet.open('152.xls') do |book|
2.1.1 :003 >       rows=0
2.1.1 :004?>     book.worksheet(0).each do |row|
2.1.1 :005 >           rows+=1
2.1.1 :006?>       end
2.1.1 :007?>     puts rows
2.1.1 :008?>   end
5
 => nil 

I should probably add that the gems aren't just printing 5; every cell below this row returns nil; they actually stop parsing the file after this particular row.

So here are my questions: fist, why is irb behaving differently than ruby? Secondly, why isn't any of those gems loading the entire spreadsheet? Third, what can I do to fix this?

Thanks for your help

ailnlv
  • 1,779
  • 1
  • 15
  • 29
  • Can you confirm which version of rubyzip you're using? I was trying with 0.9.9 and was getting results consistent with what you mentioned, but just wanted to be sure. I noticed the newest version of rubyzip breaks simple-spreadsheet. – mralexlau Apr 01 '14 at 16:13
  • I have rubyzip 1.1.2 and 0.9.9. When reading xlsx files I have to add the line "gem 'rubyzip', '< 1.0.0'" before requiring simple-spreadsheet or the script breaks when trying to load the spreadsheet, but that doesn't seem to be the case with xls files. Is there a way to check which version of rubyzip is actually being loaded? – ailnlv Apr 01 '14 at 17:57
  • 1
    Looks like you can do `Gem.loaded_specs['rubyzip'].version` within a program / within irb. Make sure you do the appropriate require first (eg - `require 'simple-spreadsheet'`) – mralexlau Apr 01 '14 at 19:21
  • simple-spreadsheet and roo load rubyzip 0.9.9, both when running ruby from the console and within irb. However, spreadsheet doesn't load rubyzip at all. – ailnlv Apr 01 '14 at 19:34
  • Thanks. I looked at this yesterday and it does in seem like a really strange issue, particularly the differences between the ruby program and irb behavior. Could you possibly make a copy of the spreadsheet, but with fewer lines (say 20 instead of 154) and see if it still has the same behavior? If it does still only read the first 5 lines, having a smaller file will decrease the debug time drastically. – mralexlau Apr 01 '14 at 20:02
  • Even though I was a bit late (since you already answered my question), I deleted a few rows off of the excel and left only 17 (so now the right output should be 17 in all cases) and uploaded it [here](http://www.megafileupload.com/en/file/514734/151-xls.html). However, it still takes almost a minute to load and parse when using inspect, while other spreadsheets take just a few seconds at most. – ailnlv Apr 02 '14 at 04:16

1 Answers1

1

This is one of the strangest issues I've ever seen. To answer your questions:

First, why is irb behaving differently than ruby?

IRB is a tool that is written in ruby, and is different than executing ruby directly. One of the main differences is that IRB has different options that can be set on startup. You can see the different flags in the source, and this page has some good descriptions of what those flags mean.

The culprit for the inconsistencies you're seeing is the --noinspect option, run irb --noinspect and executing your code should give you the same strange behavior as running the ruby program (where you only see 5 rows). This is because irb defaults to calling inspect on every line that's executed, so when you call s=SimpleSpreadsheet::Workbook.read('151.xls') it's actually like calling this in a ruby program:

s = SimpleSpreadsheet::Workbook.read('151.xls')
s.inspect

Why does calling s.inspect matter? Keep reading...

Secondly, why isn't any of those gems loading the entire spreadsheet?

This is the hardest question to answer, I still haven't pinpointed it 100%, but hopefully the 90% explanation is good enough. In short, all of the gems you mention rely on the spreadsheet gem. One of the quirks/bugs of that gem in this particular setup (I don't think it's designed to always work like this) seems to be its reliance on the inspect method. If you call inspect on the worksheet it is persisting more values than if you don't call it. That is to say, if you don't call inspect on it's only reading part of the file (or reading some of it and disregarding the rest, I'm not sure) but if you do call inspect it reads the file entirely.

Third, what can I do to fix this?

As I alluded to before, manually calling inspect should read the entire file:

s = SimpleSpreadsheet::Workbook.read('151.xls')
s.inspect
mralexlau
  • 1,823
  • 1
  • 16
  • 23
  • This is insane, how did you figure this out? Please let me know if you actually pinpoint the issue, but it probably has something to do with why the gem takes so long to read the spreadsheet when compared to others of similar size. I finally wrote the same code in python, which takes a few seconds to download and parse about 50 spreadsheets, while ruby is taking over 50 seconds to parse just one spreadsheet. Should I open a ticket with the maintainers of spreadsheet? – ailnlv Apr 02 '14 at 03:36
  • I had originally tried to debug the 2 versions side by side (ruby program vs running the code in irb). This didn't get me anywhere so eventually I lucked out when researching how irb works. – mralexlau Apr 02 '14 at 15:08
  • You're right in that the code seems to be slow, I suspect that's an issue with `simple-spreadsheet` [depending on an old version](https://github.com/zenkay/simple-spreadsheet/blob/master/Gemfile.lock#L7) of the `spreadsheet` gem (0.6.5.9). You could submit an issue to request to have it updated, but since `simple-spreadsheet` just references other gems, I would probably ditch it in favor of `roo` which uses a newer version of spreadsheet. – mralexlau Apr 02 '14 at 15:13
  • I also see you've [submitted an issue with spreadsheet](https://github.com/zdavatz/spreadsheet/issues/93), which was the other thing I was going to suggest :) – mralexlau Apr 02 '14 at 15:14