Can Anemone crawl html files stored locally on my hard drive?

Question

I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before crawling it with something like Anemone + Nokogiri. When I tried the sample code with the government site's online URL, everything worked fine, but when I change the URL to my local file path, the code runs, but doesn't produce any output. Here's the code:

url="file:///C:/2011/index.html"

Anemone.crawl(url) do |anemone|
  titles = []
  anemone.on_every_page { |page| titles.push page.doc.at

('title').inner_html rescue nil }
  anemone.after_crawl { puts titles.compact }
end

So nothing gets outputted with the local file name, but it works successfully if I plug in the corresponding online URL. Is Anemone somehow unable to crawl local directory structures? If not, are there other suggested ways for doing this crawling/scraping, or should I simply run Anemone on the online version of the site? Thanks.

score 1 · Accepted Answer · answered May 31 '12 at 17:44

1

You have couple of problems with this approach

Anemone expect a web address to issue http request and you are passing it a file. You can just load the file with nokogiri instead and do the parsing through it
The links on the files might be full urls rather than the relative paths, in this case you still need to issue http request

What you could do is download the files locally, than traverse through them using nokogiri and convert the links to local path for Nokogiri to load next

answered May 31 '12 at 17:44

Yuriy Goldshtrakh

2,014
1
11
7

Yup, nailed it. Nokogiri itself does the trick. It just takes a bit more work to get it to loop through and follow each link, but it is capable of crawling and scraping on its own. No need for Anemone in this case, though maybe it would have saved me some time. Thanks Yuriy. – jengman cd Jun 02 '12 at 06:28

Can Anemone crawl html files stored locally on my hard drive?

1 Answers1