Anyone know of a caching plugin for Ruby Mechanize?

Question

I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.

Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?

I'm not sure if this would work for what you want out of the box, since it's apparently designed for reverse-proxying rather than proxying, but perhaps it could be re-purposed to do what you need? http://rtomayko.github.com/rack-cache/ — Steve Jorgensen, Apr 10 '11 at 20:41

score 8 · Answer 1 · answered Aug 23 '13 at 22:13

A good way of doing this type of thing is to use the (AWESOME) VCR gem.

Here's an example of how you would do it:

require 'vcr'
require 'mechanize'

# Setup VCR's configs.  The cassette library directory is where 
# all of your "recordings" are saved as YAML files.  
VCR.configure do |c|
  c.cassette_library_dir = 'vcr_cassettes'
  c.hook_into :webmock
end

# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
  a = Mechanize.new
  a.get('http://google.com/')
end

As you can see... VCR records the communication as a YAML file on the first run:

mario$  find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml

If you want to have VCR create new versions of the cassettes, just delete the corresponding file.

score 2 · Answer 2 · answered Oct 19 '13 at 19:09

If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.

# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]

# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)

I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:

# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
#    for in-memory storage, you could use a Hash.
#    or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)

  def get_cached(uri)
    cache_key = "_cache/#{uri}"

    if args = storage[cache_key]
      logger.debug("getting (cached) #{uri}")
      uri, response, body, code = args
      page = Mechanize::Page.new(uri, response, body, code, agent)
      agent.send(:add_to_history, page)
      page

    else
      logger.debug("getting (UNCACHED) #{uri}")
      page = agent.get(uri)
      storage[cache_key] = [page.uri, page.response, page.body, page.code]
      page

    end
  end

end

Which you could use like this:

require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'

storage = {}

foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")

pp storage

Which prints the following information:

D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
  [#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
   {"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"87",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
   "200"],
 "_cache/http://ifconfig.me/encoding"=>
  [#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
   {"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
    "server"=>"Apache",
    "vary"=>"Accept-Encoding",
    "content-encoding"=>"gzip",
    "content-length"=>"42",
    "connection"=>"close",
    "content-type"=>"text/plain"},
   "gzip,deflate,identity\n",
   "200"]}

score 2 · Answer 3 · answered Apr 10 '11 at 20:56

I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.

I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.

I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.

Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.

Thanks. I do keep a hash of visited pages during the run to avoid getting stuck in loops. I can also see from the Mechanize source that it also keeps a history and uses If-Modified-Since when it can. I was hoping that someone may have extended this to put the history on disk or in a DB or whatever. — David Tinker, Apr 11 '11 at 04:32

score 1 · Accepted Answer · answered Apr 10 '11 at 21:23

1

How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?

answered Apr 10 '11 at 21:23

fengolly

503
3
8

You could still end up with many duplicate pages without a way to resolve multiple redirects to the same page. – the Tin Man Apr 10 '11 at 22:01
Separate fetch and scan scripts is a good idea and not too hard to implement. Thanks. – David Tinker Apr 11 '11 at 04:36

Anyone know of a caching plugin for Ruby Mechanize?

4 Answers4