0

Background:

I have a Scrapy spider running on Scrapy Cloud using Crawlera for proxies. The website I am trying to crawl is deep in the sense that each page has many "next" pages (i.e., pagination). Sometimes it can be up to 50 pages deep in terms of pagination. I am trying to crawl each and every "paginated" page.

Problem:

Every now and then Scrapy raises a [scrapy.core.scraper] Spider error processing <GET https://[URL]?page=2&view=list> (referer: https://[URL]?page=1&view=list). Full traceback as follows:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
GeneratorExit

The problem I am facing: say the first page I visit has 50 paginated follow-up pages. Say I receive the above error on page #1. This means I miss all data on pages 2-50 which is a big problem!

Question:

Is there a way in Scrapy to keep track of the <GET https://[URL]?page=2&view=list> urls that failed and revisit them at a later stage during the same scraping run? Or if not, is there a way to tell Scrapy to retry this error a specific number of times?

Full code:

def parse(self, response):
    links = response.css('div.title--inline').css('a::attr(href)').extract()
    try:
        pagination = response.css('li.pagination--next').css('a::attr(href)').extract_first()
    except:
        pagination = False
    for link in links :
        yield Request(link, callback=self.parse_details)
    if pagination: # This is where the error originates (if it happens)
        yield response.follow(pagination, self.parse)
Keida
  • 23
  • 1
  • 4
  • Can you please provide the full Traceback error, or is that all it provides? What happens if you run the spider locally and not on the scrapy cloud? – fnet Dec 17 '19 at 00:15
  • That is the full traceback error it provides unfortunately. Running it locally gives the same error occasionally as well I'm afraid (note this error does not happen often / for every page that is paginated - it only appears now and then when scraping 1000s of pages). – Keida Dec 17 '19 at 00:34
  • Perhaps you can try to capture more information about the error by throwing a try/except over the yield and try to capture it at a higher level using python logging, The `GeneratorExit` is vague as it is a catch all. I'm assuming it is going to be something related to pickling the response, so you can try reducing your concurrent scrapes as well and see how it behaves. One thing you could also try to reduce the impacts of pickling, is to send a .copy() of meta data to the callback. This way it isn't referencing an instance and might force it to behave how we would expect. – fnet Dec 17 '19 at 00:45
  • I will try the try/except block using logging to see what it throws in more detail. In terms of pickling the response - how does the number of concurrent requests the spider is running at affect the possibility of a GeneratorExit? – Keida Dec 17 '19 at 00:49
  • Can you provide a [minimal](https://stackoverflow.com/help/minimal-reproducible-example) but complete spider to reproduce the issue? – Gallaecio Dec 17 '19 at 15:38

0 Answers0