3

I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.

Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?

blahdiblah
  • 33,069
  • 21
  • 98
  • 152
David Tinker
  • 9,383
  • 9
  • 66
  • 98
  • I'm not sure if this would work for what you want out of the box, since it's apparently designed for reverse-proxying rather than proxying, but perhaps it could be re-purposed to do what you need? http://rtomayko.github.com/rack-cache/ – Steve Jorgensen Apr 10 '11 at 20:41

4 Answers4

8

A good way of doing this type of thing is to use the (AWESOME) VCR gem.

Here's an example of how you would do it:

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

As you can see... VCR records the communication as a YAML file on the first run:

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

If you want to have VCR create new versions of the cassettes, just delete the corresponding file.

Mario Zigliotto
  • 8,315
  • 7
  • 52
  • 71
2

If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

Which you could use like this:

require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'

storage = {}

foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")

pp storage

Which prints the following information:

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}
John Douthat
  • 40,711
  • 10
  • 69
  • 66
2

I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.

I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.

I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.

Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks. I do keep a hash of visited pages during the run to avoid getting stuck in loops. I can also see from the Mechanize source that it also keeps a history and uses If-Modified-Since when it can. I was hoping that someone may have extended this to put the history on disk or in a DB or whatever. – David Tinker Apr 11 '11 at 04:32
1

How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?

fengolly
  • 503
  • 3
  • 8