0

When doing web scraping with Nokogiri I occasionally get the following error message

 undefined method `at_css' for nil:NilClass (NoMethodError)

I know that the selected element is present at some time, but the site is sometimes a bit slow to respond, and I guess this is the reason why I'm getting the error.

Is there some way to wait until a certain selector is present before proceeding with the script?

My current http request block looks like this

url = URL
body = BODY
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 200 # default 60 seconds
http.open_timeout = 200 # default nil
http.use_ssl = true
request = Net::HTTP::Post.new(uri.request_uri)
request.body = body
request["Content-Type"] = "application/x-www-form-urlencoded"
begin
  response = http.request(request)
  doc = Nokogiri::HTML(response.body)
rescue
  sleep 100
  retry
end
vorpyg
  • 2,505
  • 4
  • 25
  • 22
  • 5
    If you are waiting for Javascript to evaluate or an AJAX request to complete - Nokogiri won't run Javascript so it'll never appear. You can either make the AJAX request manually, or use some Javascript-enabled browser emulator, maybe one of the emulators abstracted by [Capybara](https://github.com/jnicklas/capybara). – Leonid Shevtsov Feb 13 '15 at 10:18
  • On my opinion, [Watir](http://watir.com/) much better and easier in using. – Oleksandr Holubenko Feb 13 '15 at 10:22
  • Thanks, but I'm not waiting for any Javascript/AJAX to complete/evaluate. – vorpyg Feb 13 '15 at 10:23
  • I actually did some testing with Watir, but as it uses Selenium it took quite some time to finish as I'm making quite a number of requests. Therefore I switched to Nokogiri and Net::Http – vorpyg Feb 13 '15 at 10:25
  • 3
    `Net::HTTP` is **not** [streaming the response](http://ruby-doc.org/stdlib-2.2.0/libdoc/net/http/rdoc/Net/HTTP.html#class-Net::HTTP-label-Streaming+Response+Bodies): *"By default Net::HTTP reads an entire response into memory."*. `response.body` always returns the full body. – Stefan Feb 13 '15 at 10:28
  • Thanks, Stefan, that looks interesting! – vorpyg Feb 13 '15 at 10:35

1 Answers1

1

While you can use a streaming Net::HTTP like @Stefan says in his comment, and an associated handler that includes Nokogiri, you can't parse a partial HTTP document using a DOM model, which is Nokogiri's default, because it expects the full document also.

You could use Nokogiri's SAX parser, but that's an entirely different programming style.

If you're retrieving an entire page, then use OpenURI instead of the lower-level Net::HTTP. It automatically handles a number of things that Net::HTTP will not do by default, such as redirection, which makes it a lot easier to retrieve pages and will greatly simplify your code.

I suspect the problem is either that the site is timing out, or the tag you're trying to find is dynamically loaded after the real page loads.

If it's timing out you'll need to increase your wait time.

If it's dynamically loading that markup, you can request the main page, locate the appropriate URL for the dynamic content and load it separately. Once you have it, you can either insert it into the first page if you need everything, or just parse it separately.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks, I forgot that sleep interval is specified in ms, not seconds as the NET::HTTP timeout settings, and this seems to make the whole difference. – vorpyg Feb 16 '15 at 09:10