3

I need to share one common object instance among crawlers / spiders running on scrapyd. The best scenario would be to hook the object's methods on each spider's signals, something like

ext = CommonObject()
crawler.signals.connect( ext.onSpiderOpen,   signal = signals.spider_opened )
crawler.signals.connect( ext.onSpiderClose,  signal = signals.spider_closed )

etc..

where CommonObject would be instantiated and initialized only once and expose its methods to all running crawling processes / spiders (I don't mind using singleton for this purpose).

Based on my research I understand I have two options:

  1. Run all spiders / crawlers within one CrawlerProcess, where also the CommonObject would be instantiated.
  2. Run one spider / crawler per CrawlerProcess (default scrapy(d) behavior), instantiate the CommonObject somewhere in the reactor and perhaps access it remotely using twisted.spread.pb.

Questions:

  1. Are there any CPU utilization penalties (is CPU utilized less effectively) using the first option over letting scrapyd to manage the processes (which is the second option)? Is CrawlerProcess capable of running more crawlers in parallel (not sequentially)? How would you schedule further spiders during run-time within the same CrawlerProcess? (I understand CrawlerProcess.start() is blocking.)
  2. I am not advanced enough to implement the second option (actually not sure whether it's viable option at all). Is there anybody who would draw a sample implementation?
  3. Perhaps I am missing something and there is another way of doing this?

0 Answers0