0

I want to set the HTTPCACHE_DIR setting to the value which the user provides through the custom arguments.

  • 2
    Just to be clear: you don't want to use `scrapy crawl myspider -s HTTPCACHE_DIR="..."` (which automatically sets the value, but rather you want to use `... -a something=abc` and then construct the full directory inside the spider? – malberts Feb 18 '19 at 13:04
  • Yes, I want it the way you described – Jigar Chavada Feb 18 '19 at 13:09
  • 1
    I have deleted my answer and will instead refer you to this https://github.com/scrapy/scrapy/issues/2392#issuecomment-259661978 It's perhaps better to approach the problem another way. Make a script wrapper that start the execution for instance: https://stackoverflow.com/a/42512653/2781701 – Rafael Almeida Feb 18 '19 at 15:11

1 Answers1

0

By defalut Scrapy uses HTTPCACHE_DIR setting in FileSystemCacheStorage which is a part HttpCacheMiddleware:

class FilesystemCacheStorage(object):

    def __init__(self, settings):
        self.cachedir = data_path(settings['HTTPCACHE_DIR'])
        self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
        self.use_gzip = settings.getbool('HTTPCACHE_GZIP')
        self._open = gzip.open if self.use_gzip else open

As You can see Scrapy reads HTTPCACHE_DIR setting parameter only one time when Scrapy create FilesystemCacheStorage. Even if You somehow change HTTPCACHE_DIR setting later It will not change cachedir.
There is the only way to change cachedir during scraping process - is to change cachedir property of FilesystemCacheStorage object. You can implement this in your spider code:
(for scrapy crawl myspider -a HTTPCACHE_DIR="cache_dir")

import scrapy
class MySpider(scrapy.Spider):
    def start_requests(self):
        if self.HTTPCACHE_DIR:
            #Select downloader middlewares
            downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
            #Select HttpCacheMiddleware
            HttpCacheMiddleware = [middleware for middleware in downloader_middlewares if "HttpCacheMiddleware" in str(type(middleware))][0]
            #Change cachedir
            HttpCacheMiddleware.storage.cachedir = scrapy.utils.project.data_path(self.HTTPCACHE_DIR)
Georgiy
  • 3,158
  • 1
  • 6
  • 18