2

I have written a scrapy scraper that writes data out using the JsonItemExporter and I have worked out how to export this data to my AWS S3 using the following Spider Settings in ScrapingHub

AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-001.json

What I need to do is dynamically set the date / time on the output file and I would love it if it was using a date and time format like this jobs-20171215-1000.json but I don't know how to set a dynamic FEED_URI with scrapinghub.

There is not much information online and the only example I can find is here on the scraping hub site but unfortunately it does not work.

When I apply these settings based on the example in the documentation

AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json

Note the %(time) in my URI

The scraping fails with the following errors

[scrapy.utils.signal] Error caught on signal handler: <bound method ?.open_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 190, in open_spider
    uri = self.urifmt % self._get_uri_params(spider)
ValueError: unsupported format character 'j' (0x6a) at index 53

[scrapy.utils.signal] Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 220, in item_scraped
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
David Cruwys
  • 6,262
  • 12
  • 45
  • 91

1 Answers1

2

I misunderstood the importance of the s in the documentation and did not realize that it was part of the token signature.

I altered

FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json

to

FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time)s.json

as per the documentation and solved the problem

%(time)

changed to

%(time)s

David Cruwys
  • 6,262
  • 12
  • 45
  • 91