4

I've implemented a web application that is triggering scrapy spiders using scrapyd API (web app and scrapyd are running on the same server).

My web application is storing job ids returned from scrapyd in DB. My spiders are storing items in DB.

Question is: how could I link in DB the job id issued by scrapyd and items issued by the crawl?

I could trigger my spider using an extra parameter - let say an ID generated by my web application - but I'm not sure it is the best solution. At the end, there is no need to create that ID if scrapyd issues it already...

Thanks for your help

mouch
  • 335
  • 2
  • 12

2 Answers2

6

The question should be phrased as "How can I get a job id of a scrapyd task in runtime?"

When scrapyd runs a spider it actually gives the spider the job id as an argument. Should be always as last argument of sys.args.

Also, os.environ['SCRAPY_JOB'] should do the trick.

Rafael Almeida
  • 5,142
  • 2
  • 20
  • 33
  • 3
    Thanks, you are right! I gave it a try writing `logger.debug(kwargs)` in my spider constructor and the scrapyd job id showed up with this key `DEBUG: {'_job': 'd584ea40454911e794246c4008a91422'}` – mouch May 30 '17 at 15:13
2

In the spider constructor(inside init), add the line -->

self.jobId = kwargs.get('_job')

then in the parse function pass this in item,

def parse(self, response):
    data = {}
    ......
    yield data['_job']

in the pipeline add this -->

def process_item(self, item, spider):
    self.jobId = item['jobId']
    .......
Sadia
  • 91
  • 1
  • 4