I have been struggling for already quite some time, and have been not able to solve it. The problem is that I have a start_urls list of a few hundred URLs, but only a part of these URLs are consumed by the start_requests() of my spider.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
#SETTINGS
name = 'example'
allowed_domains = []
start_urls = []
#set rules for links to follow
link_follow_extractor = LinkExtractor(allow=allowed_domains,unique=True)
rules = (Rule(link_follow_extractor, callback='parse', process_request = 'process_request', follow=True),)
def __init__(self,*args, **kwargs):
super(MySpider, self).__init__(* args, ** kwargs)
#urls to scrape
self.start_urls = ['https://example1.com','https://example2.com']
self.allowed_domains = ['example1.com','example2.com']
def start_requests(self):
#create initial requests for urls in start_urls
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse,priority=1000,meta={'priority':100,'start':True})
def parse(self, response):
print("parse")
I have read multiple post on StackOverflow on this issue, and some threads on Github (all the way back to 2015), but haven't been able to get it too work.
To my understanding the problem is that while I create my initial requests, other requests already have generated a response which is parsed and has created new requests that fill up the queue. I confirmed that this is my problem, as when I use a middleware to limit the number of pages to be downloaded per domain to 2, the issue seems to be resolved. This would make sense, as the first created request would only generated a few new requests, leaving space in the queue for the remainder of the start_urls list.
I also noticed when I reduce the concurrent requests from 32 to 2, even a smaller part of the start_urls list is consumed. Increasing the number of concurrent request to a few hundred is not possible, as this leads to timeouts.
It is still unclear why the spider shows this behavior and just doesn't continue with consuming the start_urls. It would be much appreciated if someone could give me some pointers to a potential solution for this issue.