I have the following task: in the DB we have ~2k URLs. for each URL we need to run spider until all URLs will be processed. I was running spider for a bunch of URLs (10 in one run)
I have used the following code:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
process = CrawlerProcess(settings)
limit = 10
kount = 0
for crawl in crawler_table.find(crawl_timestamp=None):
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[crawl['crawl_url']]
)
process = CrawlerProcess(settings)
process.start()
but it is running only for the first loop. for the second I have the error:
File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
is there any solution to avoid this error? and run spider for all 2k URLs?