Use scrapyd job id in scrapy pipelines

Question

I've implemented a web application that is triggering scrapy spiders using scrapyd API (web app and scrapyd are running on the same server).

My web application is storing job ids returned from scrapyd in DB. My spiders are storing items in DB.

Question is: how could I link in DB the job id issued by scrapyd and items issued by the crawl?

I could trigger my spider using an extra parameter - let say an ID generated by my web application - but I'm not sure it is the best solution. At the end, there is no need to create that ID if scrapyd issues it already...

Thanks for your help

Rafael Almeida · Accepted Answer · 2017-05-30T15:16:08.557

6

The question should be phrased as "How can I get a job id of a scrapyd task in runtime?"

When scrapyd runs a spider it actually gives the spider the job id as an argument. Should be always as last argument of sys.args.

Also, os.environ['SCRAPY_JOB'] should do the trick.

edited May 30 '17 at 15:16

answered May 30 '17 at 14:27

Rafael Almeida

5,142
2
20
33

3

Thanks, you are right! I gave it a try writing `logger.debug(kwargs)` in my spider constructor and the scrapyd job id showed up with this key `DEBUG: {'_job': 'd584ea40454911e794246c4008a91422'}` – mouch May 30 '17 at 15:13

Sadia · Answer 2 · 2021-04-07T07:01:13.677

2

In the spider constructor(inside init), add the line -->

self.jobId = kwargs.get('_job')

then in the parse function pass this in item,

def parse(self, response):
    data = {}
    ......
    yield data['_job']

in the pipeline add this -->

def process_item(self, item, spider):
    self.jobId = item['jobId']
    .......

edited Apr 07 '21 at 07:01

answered Apr 07 '21 at 06:52

Sadia

91
1
4

Use scrapyd job id in scrapy pipelines

2 Answers2