0

What's the advised action to take should Scrapy fail with exception:

OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 86, in process_response
    self._cache_response(spider, response, request, cachedresponse)
  File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 106, in _cache_response
    self.storage.store_response(spider, request, response)
  File "/usr/lib/python3.6/site-packages/scrapy/extensions/httpcache.py", line 317, in store_response
    f.write(to_bytes(repr(metadata)))
OSError: [Errno 28] No space left on device

In this specific case, a ramdisk/tmpfs limited to 128 MB was used as cache disk, with scrapy setting HTTPCACHE_EXPIRATION_SECS = 300 on httpcache.FilesystemCacheStorage

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 300
HTTPCACHE_DIR = '/tmp/ramdisk/scrapycache' # (tmpfs on /tmp/ramdisk type tmpfs (rw,relatime,size=131072k))
HTTPCACHE_IGNORE_HTTP_CODES = ['400','401','403','404','500','504']
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

I might be wrong, but i get the impression that Scrapy's FilesystemCacheStorage might not be managing it's cache (storage limitations) all that well (?) .

Might it be better to use LevelDB ?

DhP
  • 306
  • 1
  • 11
  • Seems 5 minutes is too short a time for anything to get old enough to drop out the cache - 128MB is fairly small - could you not just increase it to a more realistic level? – Jon Clements Mar 31 '18 at 14:54
  • @JonClements Increasing the size of the ramdisk somewhat is doable, but i feel it would be nice to handle the error regardless (without simply cleansing it completely when full, and/or writing a script that keeps watch over the disk). Default cache is 30 seconds i believe. – DhP Mar 31 '18 at 14:59
  • You could extend the middleware to handle it, but it's probably simpler to either 1) increase the space, 2) decrease the cache time or 3) increase the delay between requests so you're filling the cache more slowly. – Jon Clements Mar 31 '18 at 15:05
  • @JonClements by '''too short time''' , do you mean Scrapy updates access times? If so, i think that might be the problem. – DhP Mar 31 '18 at 15:06
  • @JonClements timing is 2 concurrent request per domain with 10 second delay – DhP Mar 31 '18 at 15:08
  • I also can't remember if scrapy actively will clear up its cache or whether it will just keep it forever and just make new requests when the expiration has passed and update the cache again. (ie - it'll grow indefinitely) – Jon Clements Mar 31 '18 at 15:09
  • @JonClements That's the impression i get, that it just grows over time, with not much effective cleanup. In my personal opinion, it should be eliminating the oldest of cached data at full disk. – DhP Mar 31 '18 at 15:10
  • Yeah... can't see anything in https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpcache.py - so looks like you'll have to implement your own TTL stuff... (or store it in something like redis which you can set TTL live on - however, if you're trying to just use 128mb on a tmpfs - that may be far from suitable) – Jon Clements Mar 31 '18 at 15:15
  • @JonClements Yeah. After failure, there was 34604 files left in the cache ..Which to me does not really add up to 2 connections per domain with 10 seconds delay, and at 5 minutes expiration time. and all seem to be html text at average (binary files are not scraped/downloaded) 25KB – DhP Mar 31 '18 at 15:31
  • @JonClements i think you are correct. From what i glance from the code, it only decides to not be read from cache (after expiration time), simply skipping it, without there being anything that deletes it. At least not there. – DhP Mar 31 '18 at 16:01
  • Seems like LevelDB caching fares no better as for cache management – DhP Apr 07 '18 at 17:35

1 Answers1

1

You are right. Nothing will be deleted after the cache is expired. HTTPCACHE_EXPIRATION_SECS settings only decide whether to use cache response or re-download, for all HTTPCACHE_STORAGE.

If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can extend the backend storage to add a LoopingCall Task to delete expired cache continously.

Why scrapy keep around data that's being ignored?

I think there are two points:

  • HTTPCACHE_EXPIRATION_SECS control whether to use cache response or re-download, it only gurantee that you use no-expire cache. Different spiders may set different expiration_secs, deleting cache will make cache in confusion.

  • If you want to delete expired cache, there will need a LoopingCall Task to check expired cache continously, it make scrapy extension more complex, which not scrapy want to be.

Alex
  • 26
  • 3