14

I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:

http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168

But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?

del
  • 6,341
  • 10
  • 42
  • 45
  • 1
    did you try to ask at the ML? It feels unfair to me if I ask your question there and just paste the answer :P – naeg Oct 18 '11 at 09:34
  • 1
    If you have a solution to my problem, that's fine by me - just reference your source ;) – del Oct 22 '11 at 14:04

2 Answers2

21

If you run crawl --record=[cache.file] [scraper], you'll be able then use replay [scraper].

Alternatively, you can cache all responses with the HttpCacheMiddleware by including it in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}

If you do this, every time you run the scraper, it will check the file system first.

Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
  • I tried `scrapy crawl --record=mycache myspider` and got the error message "crawl: error: no such option: --record". I am using Scrapy 0.12.0.2548. Using HttpCacheMiddleware won't work since I will make multiple identical requests over time which will return different responses. – del Oct 22 '11 at 14:01
5

You can enable HTTPCACHE_ENABLED as said http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled

to cache all http request and response to implement resume crawling.

OR try Jobs to pause and resume crawling http://scrapy.readthedocs.org/en/latest/topics/jobs.html

fxp
  • 6,792
  • 5
  • 34
  • 45
  • This won't work if I want to make identical requests over time which will return different responses. For instance, what if I want to scrape the slashdot.org home page every hour? I can't replay this, since the cached entry will just be overwritten every hour. – del Sep 12 '12 at 07:47