0

I have a Ruby script (1.9.2p290) where I am trying to call a number of URLs, and then append information from those URLs into a file. The issue is that I keep getting an end of file error - EOFError. An example of what I'm trying to do is:

require "open-uri"
proxy_uri = URI.parse("http://IP:PORT")
somefile = File.open("outputlist.txt", 'a')

(1..100).each do |num|
  page = open('SOMEURL' + num, :proxy => proxy_uri).read
  pattern = "<img"   
  tags = page.scan(pattern)
  output << tags.length
end
somefile.puts output
somefile.close

I don't know why I keep getting this end of file error, or how I can avoid getting the error. I think it might have something to do with the URL that I'm calling (based on some dialogue here: What is an EOFError in Ruby file I/O?), but I'm not sure why that would affect the I/O or cause an end of file error.

Any thoughts on what I might be doing wrong here or how I can get this to work?

Thanks in advance!

Community
  • 1
  • 1
Cam Norgate
  • 630
  • 10
  • 21

1 Answers1

1

The way you are writing your file isn't idiomatic Ruby. This should work better:

(1..100).each do |num|
  page = open('SOMEURL' + num, :proxy => proxy_uri).read
  pattern = "<img"   
  tags = page.scan(pattern)
  output << tags.length
end

File.open("outputlist.txt", 'a') do |fo|
  fo.puts output
end

I suspect that the file is being closed because it's been opened, then not written-to while 100 pages are processed. If that takes a while I can see why they'd close it to avoid apps using up all the file handles. Writing it the Ruby-way automatically closes the file immediately after the write, avoiding holding handles open artificially.

As a secondary thing, rather than use a simple pattern match to try to locate image tags, use a real HTML parser. There will be little difference in processing speed, but potentially more accuracy.

Replace:

page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"   
tags = page.scan(pattern)
output << tags.length

with:

require 'nokogiri'

doc = Nokogiri::HTML(open('SOMEURL' + num, :proxy => proxy_uri))
output << doc.search('img').size
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Wow - thank you. Working perfectly now. It was definitely the fact that I was opening the file prematurely... I didn't know that declaring it also opened the file. Appreciate your help. Also, point is a good one re: parser - will layer that in too! – Cam Norgate Dec 17 '12 at 13:08
  • `somefile = File.open("outputlist.txt", 'a')` doesn't declare the variable, it opens the file. Ruby doesn't need variables declared in advance. – the Tin Man Dec 17 '12 at 15:03