Background:
I have a Scrapy spider running on Scrapy Cloud using Crawlera for proxies. The website I am trying to crawl is deep in the sense that each page has many "next" pages (i.e., pagination). Sometimes it can be up to 50 pages deep in terms of pagination. I am trying to crawl each and every "paginated" page.
Problem:
Every now and then Scrapy raises a [scrapy.core.scraper] Spider error processing <GET https://[URL]?page=2&view=list> (referer: https://[URL]?page=1&view=list)
. Full traceback as follows:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
The problem I am facing: say the first page I visit has 50 paginated follow-up pages. Say I receive the above error on page #1. This means I miss all data on pages 2-50 which is a big problem!
Question:
Is there a way in Scrapy to keep track of the <GET https://[URL]?page=2&view=list>
urls that failed and revisit them at a later stage during the same scraping run? Or if not, is there a way to tell Scrapy to retry this error a specific number of times?
Full code:
def parse(self, response):
links = response.css('div.title--inline').css('a::attr(href)').extract()
try:
pagination = response.css('li.pagination--next').css('a::attr(href)').extract_first()
except:
pagination = False
for link in links :
yield Request(link, callback=self.parse_details)
if pagination: # This is where the error originates (if it happens)
yield response.follow(pagination, self.parse)