scrapy run thousands of instance of the same spider

Question

I have the following task: in the DB we have ~2k URLs. for each URL we need to run spider until all URLs will be processed. I was running spider for a bunch of URLs (10 in one run)

I have used the following code:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )
    process = CrawlerProcess(settings)
    process.start()

but it is running only for the first loop. for the second I have the error:

  File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

is there any solution to avoid this error? and run spider for all 2k URLs?

zaki98 · Accepted Answer · 2023-03-12T13:24:18.090

This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit = 10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == "__main__":
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()

but in this case we will start process for all 2k URLs = 2K spiders simultaneously. I need to have limit of spiders running simultaneously (ex 10) — Roman, Mar 12 '23 at 12:45

scrapy run thousands of instance of the same spider

1 Answers1