2

My app does a lot of page scraping, e.g. fetching historical weather data. Once I've fetched a specific page, I'd like to cache it in my PostgreSQL database so I don't have to hit the remote server again for that specific request.

Since the historical data never changes, I want to cache them "forever" -- this requires storing the cached pages in a long-term persistent store, e.g. a database.

I've written a rudimentary caching mechanism that wraps around Mechanize. It works, but it seems likely that someone with better coding chops than me would have already implemented this.

Are there any gems or libraries that already do this?

fearless_fool
  • 33,645
  • 23
  • 135
  • 217
  • possible duplicate of [Is there a Ruby http client library with a response cache?](http://stackoverflow.com/questions/6104922/is-there-a-ruby-http-client-library-with-a-response-cache) – Gonzalo Feb 10 '13 at 18:17
  • I don't think this is a duplicate. Just providing a big table of all possible ruby HTTP clients does not help OP to answer specific question on how to cache web pages in database. – mvp Feb 10 '13 at 20:16
  • @gonzalo: that spreadsheet is useful, but MVP is right: how can I extend Mechanize or Typhoeus or others that cache responses to act as a db-backed caching scheme? – fearless_fool Feb 11 '13 at 02:18

4 Answers4

1

So I've thought and I've thought, and looked at the source code for Mechanize and for VCR, and I've decided that I'm really just over-thinking the problem. The following works just fine for my needs. (I'm using DataMapper, but translating it into an ActiveRecord model would be straightforward):

class WebCache
  include DataMapper::Resource

  property :id, Serial
  property :serialized_key, Text
  property :serialized_value, Text
  property :created_at, DateTime
  property :updated_at, DateTime

  def with_db_cache(akey)
    serialized_key = YAML.dump(akey)
    if (r = self.all(:serialized_key => serialized_key)).count != 0
      # cache hit: return the de-serialized value
      YAML.load(r.first.serialized_value)
    else
      # cache miss: evaluate the block, serialize and cache the result
      yield(akey).tap {|avalue| 
        self.create(:serialized_key => serialized_key, 
                    :serialized_value => YAML.dump(avalue))
      }
    end
  end
end

Example usage:

def fetch(uri)
  WebCache.with_db_cache(uri) {|uri| 
    # arrive here only on cache miss
    Net::HTTP.get_response(URI(uri))
  }
end

commentary

I previously believed that a proper web-caching scheme would observe and honor header fields like Cache-Control, If-Modified-Since, etc, as well as automatically handle timeouts and other web pathology. But an examination of actual web pages made it clear that truly static data was frequently marked with short cache times. So it makes more sense to let the caller decide how long something should be cached and when a failing query should be retried.

At that point, the code became very simple.

Moral: don't over-think your problems.

fearless_fool
  • 33,645
  • 23
  • 135
  • 217
0

VCR is probably what you want.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • VCR is certainly close to what I want, and I've spent time studying it. Although I could possibly write a custom serializer that persists to a db, it's not clear how the key lookup would work on playback. – fearless_fool Feb 10 '13 at 23:47
  • Why is using a db important? – pguardiario Feb 11 '13 at 00:20
  • I anticipate > 100k cached pages. And hosting on Heroku. Reasons enough? – fearless_fool Feb 11 '13 at 02:09
  • I've done that many with vcr before. You just split it up between different 'cassettes'. You're right though, saving to a db (or memcached) would be nice for big jobs. – pguardiario Feb 11 '13 at 02:17
0

Maybe you should just use a proxy cache, like squid. It'll be faster, easier and more reliable than trying to do it yourself.

Tometzky
  • 22,573
  • 5
  • 59
  • 73
0

You could look at Open-URI-Cache or Faraday-HTTP-Cache. The first might be closer to what you need. Neither log to a database, but perhaps you could possibly write your own storage layer. I have no experience with Heroku, but the file systems seems like the right place for this kind of cache.

dkam
  • 3,876
  • 2
  • 32
  • 24