1

I am scraping a small number of sites with the ruby anemone gem.

Anemone.crawl("http://www.somesite.com") do |anemone|
         anemone.on_every_page do |page|
            ...
         end
end

Depending on the site, some require 'www' to be present in the url while others require that it be omitted. How can I configure the crawler or code it so that it known when to use the correct url?

Jackson Henley
  • 1,531
  • 2
  • 15
  • 27

2 Answers2

1

You can't know, so, do something similar to what you'd do while sitting in front of the browser.

Try one, see if you get a connection, see if you got a 200 response, then see if the title has "error" in it. If none of those fail, then consider it good.

If not, try the other.

The problem using a canned spider/crawler is you have to work around their code when the situation is different than they expected when they wrote the software.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
0

Most sites redirect www to somesite.com, or the other way around automatically, so you should not have to worry about that.

I would think Anemone can handle redirects(?). But if it can't then I suggest you pre-check the URLs for redirects before you hand them over to Anemone. You can look here how to do that:

How can I get the final URL after redirects using Ruby?

I.e.:

final_url = check_base_url_for_redirect('www.somesite.com')
Anemone.crawl(final_url) ...
Community
  • 1
  • 1
Casper
  • 33,403
  • 4
  • 84
  • 79