1

I have two spiders which I want to execute in parallel. I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file, ie FEED_URI for each spider in the same process. I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution. If the first spider completes crawling before the second one, I get the desired output. However, if the second spider finishes crawling first, then it doesn't wait for the first spider to complete. How could I actually fix this?

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spider_loader.list():
    setting['FEED_FORMAT'] = 'json'
    setting['LOG_LEVEL'] = 'INFO'
    setting['FEED_URI'] = spider_name+'.json'
    setting['LOG_FILE'] = spider_name+'.log'
    process = CrawlerProcess(setting)
    print("Running spider %s" % spider_name)
    process.crawl(spider_name)

process.start()
print("Completed")
Mithil Mohan
  • 223
  • 2
  • 11
  • You are overwriting the `process` object inside the loop. Please look at the thread https://stackoverflow.com/questions/39706005/crawlerprocess-vs-crawlerrunner – Tarun Lalwani Jun 18 '20 at 05:08
  • How else could I write the output to two separate Json Files? – Mithil Mohan Jun 18 '20 at 05:15
  • See this article https://kirankoduru.github.io/python/multiple-scrapy-spiders.html – Tarun Lalwani Jun 18 '20 at 05:20
  • I couldn't get this to work with my code, however if I do setting.update({ 'FEED_FORMAT': 'json', 'FEED_URI': spider_name + ".json", 'LOG_FILE': spider_name + '.log', 'LOG_LEVEL': 'INFO' }) alone in the loop it works fine and generates the appropriate json output. But the log files aren't proper. Is there a way to make log files proper too – Mithil Mohan Jun 18 '20 at 06:40

1 Answers1

1

According to scrapy docs using single CrawlerProcess for multiple spiders should look like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class Spider1(scrapy.Spider):
    ...

class Spider2(scrapy.Spider):
    ...

process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

setting.. settings on per spider basis can be done using custom_settings spider attribute

Scrapy has a group of modules that can't be set on per spider basis (only per CrawlerProcess).

modules that using Logging, SpiderLoader and twisted Reactor related settings - already initialized before Scrapy read spider custom_settings.

When you call scrapy crawl .... from command line tool - in fact you create single CrawlerProcess for single spider defined on command args.

process terminates as soon as the second spider completes execution.

If you used the same spider versions previously launched by scrapy crawl... this is not expected.

Georgiy
  • 3,158
  • 1
  • 6
  • 18
  • How Exactly could I modify the above code to create individual logs too? – Mithil Mohan Jun 18 '20 at 08:32
  • Logs can't be individual in this case. In this configuration there will be a single logger object per crawlerprocess (due to specifics of logging implementation in scrapy source code). The only reliable ways to do it: 1 by using `scrapy crawl ..` command line tool separately for each spider as you previously did. 2. By creating 2 separate scripts and split execution into 2 crawler processes (it is nearly the same). – Georgiy Jun 18 '20 at 13:40