1

I have the following task: in the DB we have ~2k URLs. for each URL we need to run spider until all URLs will be processed. I was running spider for a bunch of URLs (10 in one run)

I have used the following code:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )
    process = CrawlerProcess(settings)
    process.start()

but it is running only for the first loop. for the second I have the error:

  File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

is there any solution to avoid this error? and run spider for all 2k URLs?

Roman
  • 1,883
  • 2
  • 14
  • 26

1 Answers1

2

This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit = 10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == "__main__":
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()
zaki98
  • 1,048
  • 8
  • 13
  • but in this case we will start process for all 2k URLs = 2K spiders simultaneously. I need to have limit of spiders running simultaneously (ex 10) – Roman Mar 12 '23 at 12:45
  • sorry i misunderstood you then. i'll update the answer – zaki98 Mar 12 '23 at 13:22