How should my scraping "stack" handle 404 errors?

Question

I have a rake task that is responsible for doing batch processing on millions of URLs. Because this process takes so long I sometimes find that URLs I'm trying to process are no longer valid -- 404s, site's down, whatever.

When I initially wrote this there was basically just one site that would continually go down while processing so my solution was to use open-uri, rescue any exceptions produced, wait a bit, and then retry.

This worked fine when the dataset was smaller but now so much time goes by that I'm finding URLs are no longer there anymore and produce a 404.

Using the case of a 404, when this happens my script just sits there and loops infinitely -- obviously bad.

How should I handle cases where a page doesn't load successfully, and more importantly how does this fit into the "stack" I've built?

I'm pretty new to this, and Rails, so any opinions on where I might have gone wrong in this design are welcome!

Here is some anonymized code that shows what I have:

The rake task that makes a call to MyHelperModule:

# lib/tasks/my_app_tasks.rake
namespace :my_app do
  desc "Batch processes some stuff @ a later time."
    task :process_the_batch => :environment do
      # The dataset being processed
      # is millions of rows so this is a big job 
      # and should be done in batches!
      MyModel.where(some_thing: nil).find_in_batches do |my_models|
        MyHelperModule.do_the_process my_models: my_models
      end
    end
  end
end

MyHelperModule accepts my_models and does further stuff with ActiveRecord. It calls SomeClass:

# lib/my_helper_module.rb
module MyHelperModule
  def self.do_the_process(args = {})
    my_models = args[:my_models]

    # Parallel.each(my_models, :in_processes => 5) do |my_model|
    my_models.each do |my_model|
      # Reconnect to prevent errors with Postgres
      ActiveRecord::Base.connection.reconnect!
      # Do some active record stuff

      some_var = SomeClass.new(my_model.id)

      # Do something super interesting,
      # fun,
      # AND sexy with my_model
    end
  end
end

SomeClass will go out to the web via WebpageHelper and process a page:

# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
  attr_accessor :some_data

  def initialize(arg)
    doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
      # do more stuff
  end
end

WebpageHelper is where the exception is caught and an infinite loop is started in the case of 404:

# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'

class WebpageHelper
  def self.get_doc(url)
    begin
      page_content = open(url).read
      # do more stuff
    rescue Exception => ex
      puts "Failed at #{Time.now}"
      puts "Error: #{ex}"
      puts "URL: " + url
      puts "Retrying... Attempt #: #{attempts.to_s}"
      attempts = attempts + 1
      sleep(10)
      retry
    end
  end
end

So what do you want to do with 404s. Silently ignore them, log them, something else ? — Frederick Cheung, Jul 11 '12 at 15:51

score 8 · Accepted Answer · answered Jul 18 '12 at 13:31

TL;DR

Use out-of-band error handling and a different conceptual scraping model to speed up operations.

Exceptions Are Not for Common Conditions

There are a number of other answers that address how to handle exceptions for your use case. I'm taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.

In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.
In The Pragmatic Programmer: From Journeyman to Master, the authors state "[E]xceptions should be reserved for unexpected events." In your case, 404 errors are undesirable, but are not at all unexpected--in fact, handling 404 errors is a core consideration!

In short, you need a different approach. Preferably, the alternative approach should provide out-of-band error handling and prevent your process from blocking on retries.

One Alternative: A Faster, More Atomic Process

You have a lot of options here, but the one I'm going to recommend is to handle 404 status codes as a normal result. This allows you to "fail fast," but also allows you to retry pages or remove URLs from your queue at a later time.

Consider this example schema:

ActiveRecord::Schema.define(:version => 20120718124422) do
  create_table "webcrawls", :force => true do |t|
    t.text     "raw_html"
    t.integer  "retries"
    t.integer  "status_code"
    t.text     "parsed_data"
    t.datetime "created_at",  :null => false
    t.datetime "updated_at",  :null => false
  end
end

The idea here is that you would simply treat the entire scrape as an atomic process. For example:

Did you get the page?

Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.
Did you get a 404?

Fine, store the error page and the status code. Move on quickly!

When your process is done crawling URLs, you can then use an ActiveRecord lookup to find all the URLs that recently returned a 404 status so that you can take appropriate action. Perhaps you want to retry the page, log a message, or simply remove the URL from your list of URLs to scrape--"appropriate action" is up to you.

By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.

This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.

Final Thoughts: Scaling Up and Out

Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:

Forking background processes.
Using dRuby to split work among multiple processes or machines.
Maximizing core usage by spawning multiple external processes using GNU parallel.
Something else that isn't a monolithic, sequential process.

Optimizing the application logic should suffice for the common case, but if not, scaling up to more processes or out to more servers. Scaling out will certainly be more work, but will also expand the processing options available to you.

This is an amazing answer and i am SO sorry that I so long to respond that i caused you to miss the bounty. I think you are absolutely correct. I have been thinking about this wrong and this method you are suggesting will also result in less load on the sites being scraped. There is referential data involved where a scrape may have to re-pull pages and if i can get them locally it's perfect. This will also allow me to differentiate between the different data sources and their page contents. thank you! — Mario Zigliotto, Aug 18 '12 at 20:50

score 5 · Answer 2 · answered Jul 11 '12 at 16:59

Curb has an easier way of doing this and can be a better (and faster) option instead of open-uri.

Errors Curb reports (and that you can rescue from and do something:

http://curb.rubyforge.org/classes/Curl/Err.html

Curb gem: https://github.com/taf2/curb

Sample code:

def browse(url)
  c = Curl::Easy.new(url)
  begin
    c.connect_timeout = 3
    c.perform
    return c.body_str
  rescue Curl::Err::NotFoundError
    handle_not_found_error(url)
  end
end

def handle_not_found_error(url)
  puts "This is a 404!"
end

score 3 · Answer 3 · answered Jul 09 '12 at 04:39

3

You could just raise the 404's:

rescue Exception => ex
  raise ex if ex.message['404']
  # retry for non-404s
end

answered Jul 09 '12 at 04:39

pguardiario

53,827
19
119
159

I'm a bit lost about what i should do once it's raised. Like where i would handle it, what i would do... etc. – Mario Zigliotto Jul 09 '12 at 05:29
If i handle that in `WebpageHelper` won't classes using it, like `SomeClass` or the `MyHelperModule` module who are iterating over need some sort of indication things have gone south? – Mario Zigliotto Jul 09 '12 at 17:54
You're misunderstanding something. The raised exception is the indication that things have gone south. – pguardiario Jul 09 '12 at 21:41
You're right i am. So would the loop happening in MyHelperModule need to watch for the exception? – Mario Zigliotto Jul 09 '12 at 23:53

Sean · Answer 4 · 2012-07-12T14:35:09.527

3

I actually have a rake task that does something remarkably similar. Here is the gist of what I did to deal with 404's and you could apply it pretty easy.

Basically what you want to do is to use the following code as a filter and create a logfile to store your errors. So before you grab the website and process it you first do the following:

So create/instantiate a logfile in your file:

@logfile = File.open("404_log_#{Time.now.strftime("%m/%d/%Y")}.txt","w")
# #{Time.now.strftime("%m/%d/%Y")} Just includes the date into the log in case you want
# to run diffs on your log files.

Then change your WebpageHelper class to something like this:

class WebpageHelper
  def self.get_doc(url)
    response = Net::HTTP.get_response(URI.parse(url))
    if (response.code.to_i == 404) notify_me(url)
    else
    page_content = open(url).read
    # do more stuff
    end
  end
end

What this is doing is pinging the page for a response code. The if statement I included is checking if the response code is a 404 and if it is run the notify_me method otherwise run your commands as usual. I just arbitrarily created that notify_me method as an example. On my system I have it writing to txt file that it emails me upon completion. You could use a similar method to look at other response codes.

Generic logging method:

def notify_me(url)
  puts "Failed at #{Time.now}"
  puts "URL: " + url
  @logfile.puts("There was a 404 error for the site #{url} at #{Time.now}.")
end

edited Jul 12 '12 at 14:35

answered Jul 09 '12 at 04:47

Sean

2,891
3
29
39

Thank you. I'm a little bit lost about what i should do once a 404 (or any code I might want) is encountered. For example, i don't know how to gracefully make this trickle up the stack and not cause problems. Almost feels like i need a real exception to latch on to then make choices but again i'm really unsure of where this would even go. – Mario Zigliotto Jul 09 '12 at 05:32
I hope the clarifications made more sense. Basically, you are taking the need to handle broken links by using exceptions out of the code by first checking what kind of response the page generates with the Net::HTTP.get_response method. So that is telling you that if it is a 404, you can just ignore ignore that page and go to the next one. You could add addition elsif (response.code.to_i == ERROR_CODE) to handle more types of errors. – Sean Jul 11 '12 at 18:26
I'm a big fan of this method, especially since it avoids exceptions and explicitly handles the result it's expecting (good or bad). – Colin R Jul 11 '12 at 19:48
Exceptions, in this case, are a good thing. Mind it, we're all talking about ERROR codes here. If it's an error, throw an exception. – Pedro Nascimento Jul 12 '12 at 15:16
I think that is Colin's, and my point. Handle the code returned from the page before needing to resort to exception handling. Although, to be fair, I think it is probably just a matter of preference. – Sean Jul 13 '12 at 14:10

score 3 · Answer 5 · answered Jul 11 '12 at 16:18

It all just depends on what you want to do with 404's.

Lets assume that you just want to swallow them. Part of pguardiario's response is a good start: You can raise an error, and retry a few times...

# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'

class WebpageHelper
  def self.get_doc(url)
    attempt_number = 0
    begin
      attempt_number = attempt_number + 1
      page_content = open(url).read
      # do more stuff
    rescue Exception => ex
      puts "Failed at #{Time.now}"
      puts "Error: #{ex}"
      puts "URL: " + url
      puts "Retrying... Attempt #: #{attempts.to_s}"
      sleep(10)
      retry if attempt_number < 10 # Try ten times.
    end
  end
end

If you followed this pattern, it would just fail silently. Nothing would happen, and it would move on after ten attempts. I would generally consider that a Bad Plan(tm). Instead of just failing out silently, I would go for something like this in the rescue clause:

    rescue Exception => ex
      if attempt_number < 10 # Try ten times.
        retry 
      else
        raise "Unable to contact #{url} after ten tries."
      end
    end

and then throw something like this in MyHelperModule#do_the_process (you'd have to update your database to have an errors and error_message column):

    my_models.each do |my_model|
      # ... cut ...

      begin
        some_var = SomeClass.new(my_model.id)
      rescue Exception => e
        my_model.update_attributes(errors: true, error_message: e.message)
        next
      end

      # ... cut ...
    end

That's probably the easiest and most graceful way to do it with what you currently have. That said, if you're handling that many request in one massive rake tasks, that's not very elegant. You can't restart it if something goes wrong, it's tying up a single process on your system for a long time, etc. If you end up with any memory leaks (or infinite loops!), you find yourself in a place where you can't just say 'move on'. You probably should be using some kind of queueing system like Resque or Sidekiq, or Delayed Job (though it sounds like you have more items that you'd end up queueing than Delayed Job would happily handle). I'd recommend digging in to those if you're looking for a more eloquent approach.

Chris, thank you. I was also thinking that using `raise` seemed like a good idea but i wasn't sure where it would go and where to `rescue` it. Do you think it would be prudent to use a custom/specific type of Exception? — Mario Zigliotto, Jul 12 '12 at 00:57
@SizzlePants: It's usually good practice to raise a custom child of `StandardError` rather than a generic, because frequently there's a gain to be had from raising one error while rescuing another. Also, it allows errors you didn't plan for to rise to the surface instead of going unnoticed. Furthermore, it's customary to create a higher-level umbrella class that can be used to catch all errors _specific to your application (or gem)_. Example: A gem "Foo" might define `class Foo::Error < StandardError` as an umbrella, and `class Foo::TimeoutError < Foo::Error` for the actual error to be raised. — sinisterchipmunk, Jul 15 '12 at 21:19

ebeland · Answer 6 · 2012-07-18T19:05:16.220

Instead of using initialize, which always returns a new instance of an object, when creating a new SomeClass from a scraping, I'd use a class method to create the instance. I'm not using exceptions here beyond what nokogiri is throwing because it sounds like nothing else should bubble up further since you just want these to be logged, but otherwise be ignored. You mentioned logging the exceptions--are you just logging what goes to stdout? I'll answer as if you are...

    # lib/my_helper_module.rb
module MyHelperModule
  def self.do_the_process(args = {})
    my_models = args[:my_models]

    # Parallel.each(my_models, :in_processes => 5) do |my_model|
    my_models.each do |my_model|
      # Reconnect to prevent errors with Postgres
      ActiveRecord::Base.connection.reconnect!

      some_object = SomeClass.create_from_scrape(my_model.id)

    if some_object
      # Do something super interesting if you were able to get a scraping
      # otherwise nothing happens (except it is noted in our logging elsewhere)
    end

  end
end

Your SomeClass:

# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
  attr_accessor :some_data

  def initialize(doc)
    @doc = doc
  end

  # could shorten this, but you get the idea...
  def self.create_from_scrape(arg)
    doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
    if doc
      return SomeClass.new(doc)
    else
      return nil
    end      
  end

end

Your WebPageHelper:

# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'

class WebpageHelper
  def self.get_doc(url)
    attempts = 0 # define attempts first in non-block local scope before using it
    begin
      page_content = open(url).read
      # do more stuff
    rescue Exception => ex
      attempts += 1
      puts "Failed at #{Time.now}"
      puts "Error: #{ex}"
      puts "URL: " + url
      if attempts < 3 
        puts "Retrying... Attempt #: #{attempts.to_s}"
        sleep(10)
        retry
      else
        return nil
      end
    end

  end
end

Roman · Answer 7 · 2012-07-13T19:41:14.407

Regarding the problem you're experiencing, you can do the following:


class WebpageHelper
  def self.get_doc(url)
    retried = false
    begin
      page_content = open(url).read
      # do more stuff
    rescue OpenURI::HTTPError => ex
      unless ex.io.status.first.to_i == 404
        log_error ex.message
        sleep(10)
        unless retried
          retried = true
          retry
        end
      end
    # FIXME: needs some refactoring
    rescue Exception => ex
      puts "Failed at #{Time.now}"
      puts "Error: #{ex}"
      puts "URL: " + url
      puts "Retrying... Attempt #: #{attempts.to_s}"
      attempts = attempts + 1
      sleep(10)
      retry
    end
  end
end

But I'd rewrite the whole thing in order to do parallel processing with Typhoeus:

https://github.com/typhoeus/typhoeus

where I'd assign a callback block which would do the handling of the returned data, thus decoupling the fetching of the page and the processing.

Something along the lines:



def on_complete(response)
end

def on_failure(response)
end

def run
  hydra = Typhoeus::Hydra.new
  reqs = urls.collect do |url|
    Typhoeus::Request.new(url).tap { |req|
      req.on_complete = method(:on_complete).to_proc }
      hydra.queue(req)
    }
  end
  hydra.run
  # do something with all requests after all requests were performed, if needed
end

score 2 · Answer 8 · answered Jul 18 '12 at 05:00

I think everyone's comments on this question are spot on and correct. There is alot of good info on this page. Here is my attempt at collecting this very hefty bounty. That being said +1 to all answers.

If you are only concerned with 404 using OpenURI you can handle just those types of exceptions

# lib/webpage_helper.rb
rescue OpenURI::HTTPError => ex
  # handle OpenURI HTTP Error!
rescue Exception => e
  # similar to the original
  case e.message
      when /404/ then puts '404!'
      when /500/ then puts '500!'
      # etc ... 
  end
end

If you want a bit more you can do different Execption handling per type of error.

# lib/webpage_helper.rb
rescue OpenURI::HTTPError => ex
  # do OpenURI HTTP ERRORS
rescue Exception::SyntaxError => ex
  # do Syntax Errors
rescue Exception => ex
  # do what we were doing before

Also I like what is said in the other posts about number of attempts. Makes sure it isn't an infinite loop.

I think the rails thing to do after a number of attempts would be to log, queue, and or email.

To log you can use

webpage_logger = Log4r::Logger.new("webpage_helper_logger")
# somewhere later
# ie 404
  case e.message
  when /404/ 
    then 
      webpage_logger.debug "debug level error #{attempts.to_s}"
      webpage_logger.info "info level error #{attempts.to_s}"
      webpage_logger.fatal "fatal level error #{attempts.to_s}"

There are many ways to queue. I think some of the best are faye and resque. Here is a link to both: http://faye.jcoglan.com/ https://github.com/defunkt/resque/

Queues work just like a line. Believe it or not the Brits call lines, "queues" (The more you know). So, using a queuing server then you can line up many requests and when the server you are trying to send the request comes back, you can hammer that server with your requests in the queue. Thus forcing their server to go down again, but hopefully over time they will upgrade their machines because they keep crashing.

And finally to email, rails also to the rescue (not resque)... Here is the link to rails guide on ActionMailer: http://guides.rubyonrails.org/action_mailer_basics.html

You could have a mailer like this

class SomeClassMailer <  ActionMailer::Base
  default :from => "notifications@example.com"
def self.mail(*args)
 ...
# then later 
rescue Exception => e
  case e.message
    when /404/ && attempts == 3
      SomeClassMailer.mail(:to => "broken@example.com", :subject => "Failure ! #{attempts}")

How should my scraping "stack" handle 404 errors?

8 Answers8

TL;DR

Exceptions Are Not for Common Conditions

One Alternative: A Faster, More Atomic Process

Final Thoughts: Scaling Up and Out