2

I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an Amazon EC2 instance. I'm exporting jsonlines files to S3 using these parameters in my spider/settings.py file:

FEED_FORMAT: jsonlines FEED_URI: s3://my-bucket-name

My scrapyd.conf file sets the items_dir property to empty:

items_dir=

The reason the items_dir property is set to empty is so that scrapyd does not override the FEED_URI property in the spider's settings, which points to an s3 bucket (see Saving items from Scrapyd to Amazon S3 using Feed Exporter).

This works as expected in most cases but I'm running into a problem on one particularly large crawl: the local disk (which isn't particularly big) fills up with the in-progress crawl's data before it can fully complete, and thus before the results can be uploaded to S3.

I'm wondering if there is any way to configure where the "intermediate" results of this crawl can be written prior to being uploaded to S3? I'm assuming that however Scrapy internally represents the in-progress crawl data is not held entirely in RAM but put on disk somewhere, and if that's the case, I'd like to set that location to an external mount with enough space to hold the results before shipping the completed .jl file to S3. Specifying a value for "items_dir" prevents scrapyd from automatically uploading the results to s3 on completion.

Community
  • 1
  • 1
bds914
  • 23
  • 2
  • Why not write the values to a json file right when each item is scraped? I wrote a [blog](http://kirankoduru.github.io/python/sqlalchemy-pipeline-scrapy.html) about how to add it to DB when each item is scraped but you could probably change it for writing to a file instead. –  Feb 16 '16 at 03:53

1 Answers1

3

The S3 feed storage option inherits from BlockingFeedStorage, which itself uses TemporaryFile(prefix='feed-') (from tempfile module)

The default directory is chosen from a platform-dependent list

You can subclass S3FeedStorage and override the open() method to return a temp file from somewhere else than the default, for example using the dir argument of tempfile.TemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None]]]]])

paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • There's an open issue about being able to customize the location: https://github.com/scrapy/scrapy/issues/1779 – paul trmbrth Feb 26 '16 at 14:27
  • Issue is now [fixed in "master"](https://github.com/scrapy/scrapy/pull/1847) branch of Scrapy. Will be available in Scrapy 1.1 – paul trmbrth Apr 09 '16 at 16:09