I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before crawling it with something like Anemone + Nokogiri. When I tried the sample code with the government site's online URL, everything worked fine, but when I change the URL to my local file path, the code runs, but doesn't produce any output. Here's the code:
url="file:///C:/2011/index.html"
Anemone.crawl(url) do |anemone|
titles = []
anemone.on_every_page { |page| titles.push page.doc.at
('title').inner_html rescue nil }
anemone.after_crawl { puts titles.compact }
end
So nothing gets outputted with the local file name, but it works successfully if I plug in the corresponding online URL. Is Anemone somehow unable to crawl local directory structures? If not, are there other suggested ways for doing this crawling/scraping, or should I simply run Anemone on the online version of the site? Thanks.