Scrapy how to save a State between spider runs (via scrapinghub)?

Question

I have a spider that will run on schedule. Spider input is based on Date. From date of last scrape to todays date. So the question is how to save the date of last scrape within the Scrapy project? There is an option to get data from scrapy settings using pkjutil module, but i did not find any reference in the docs on how to write data in that file. Any idea? Maybe an alternative? P.S. My other option is to use some free remote MySql DB just for this. But looks like more work if simple solution is available.

import pkgutil

class CodeSpider(scrapy.Spider):
name = "code"
allowed_domains = ["google.com.au"]

def start_requests(self):
    f = pkgutil.get_data("au_go", "res/state.json")
    ids = json.loads(f)
    id = ids[0]['state']

    yield {'state':id}
    ids[0]['state'] = 'New State'
    with open('./au_go/res/state.json', 'w') as f:
        json.dump(ids, f)

The above solution works fine when ran locally. But I am getting no such file or directory when running the code at Scrapinghub.

File "/tmp/unpacked-eggs/__main__.egg/au_go/spiders/test_state.py", line 33, in parse
    with open(savePath, 'w') as f:
IOError: [Errno 2] No such file or directory: './au_go/res/state.json'

You are doing it correct, just save in a plain text file,,, that is it — Umair Ayub, Nov 25 '17 at 16:52

score 1 · Accepted Answer · answered Nov 26 '17 at 14:48

The problem is fixed with use of Scrapinghub Colections

And scrapinghub API. Works nice now. Here is an example code in case somebody will find it usefull.

from scrapinghub import ScrapinghubClient


client = ScrapinghubClient(Your API KEY)
project = client.get_project(Your Project ID)
collections = project.collections

last_accessed = collections.get_store('last_accessed')
last_accessed.set({'_key': 'Date', 'value': '12-54-1235'})
print last_accessed.get('Date')['value']

Scrapy how to save a State between spider runs (via scrapinghub)?

1 Answers1