1

So I got status 503 when I crawl. It's retried, but then it gets ignored. I want it to be marked as an error, not ignored. How to do that?

I prefer to set it in settings.py so it would apply to all of my spiders. handle_httpstatus_list seems will only affect one spider.

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108

2 Answers2

1

There are two settings that you should look into:

RETRY_HTTP_CODES:

Default: [500, 502, 503, 504, 408]

Which HTTP response codes to retry. Other errors (DNS lookup issues, connections lost, etc) are always retried.

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#retry-http-codes

And HTTPERROR_ALLOWED_CODES:

Default: []

Pass all responses with non-200 status codes contained in this list.

https://doc.scrapy.org/en/latest/topics/spider-middleware.html#std:setting-HTTPERROR_ALLOWED_CODES

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
1

In the end, I overwrite the retry middleware just for a small change. I set so every time the scraper gave up retrying on something, doesn't matter what is the status code, it will be marked as an error.

It seems Scrapy somehow doesn't associate giving up retrying as an error. That's weird for me.

This is the middleware if anyone wants to use it. Don't forget to activate it on the settings.py

from scrapy.downloadermiddlewares.retry import *

class Retry500Middleware(RetryMiddleware):

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
            return retryreq
        else:
            # This is the point where I update it. It used to be `logger.debug` instead of `logger.error`
            logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108