I am using Scrapy to build a broad crawler which will crawl a few thousand pages from 50-60 different domains.
I have encountered 429
status code sometimes. I am thinking of ways of dealing with it. I am trying to set polite policies regarding Concurrent requests per domain and autothrottle settings. This is a worst-case situation.
By default, Scrapy drops the request.
If we add 429
to RETRY_HTTP_CODES
, Scrapy will use the default retry middleware which will schedule the request at the end of the queue. This will still allow other requests to the same domain to ping the server - Does this prolong the temporary block in place due to rate limiting? If not, why not use this approach only instead of trying other complex solutions as described below?
Another approach is to block the spider when it encounters a 429. However, one of the comments mentions that this will lead to a timeout in other active requests. Also, this would block requests to all the domains (which is an inefficient way as requests to other domains should continue normally). Does it make sense to temporarily reschedule requests to a particular domain instead of pinging the server continuously with other requests to the same domain? If yes, how to implement it in Scapy?
Does this solve the issue?
- When Rate Limiting is already triggered - does sending more requests (which will receive a 429 response) prolong the time period for which rate limiting is applied? Or will it have no effect on rate limiting's time period?
- How to pause scrapy to send requests to a particular domain, while continuing its other tasks (including requests to other domains)?
EDIT:
The default Retry Middleware cannot be used as it has a max retry counter - RETRY_TIMES
. After this has expired for a particular request, that request is dropped - something that we don't want in the case of a 429.