12

I'm trying to run a scraper of which the output log ends as follows:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
 'downloader/request_count': 32902,
 'downloader/request_method_count/GET': 32902,
 'downloader/response_bytes': 117633316,
 'downloader/response_count': 32902,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/429': 32781,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
 'log_count/DEBUG': 32903,
 'log_count/INFO': 32815,
 'request_depth_max': 2,
 'response_received_count': 32902,
 'scheduler/dequeued': 32902,
 'scheduler/dequeued/memory': 32902,
 'scheduler/enqueued': 32902,
 'scheduler/enqueued/memory': 32902,
 'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)

In short, of the 32,902 requests, only 121 are successful (response code 200) whereas the remainder receives 429 for 'too many requests' (cf. https://httpstatuses.com/429).

Are there any recommended ways to get around this? To start with, I'd like to have a look at the details of the 429 response rather than just ignoring it, as it may contain a Retry-After header indicating how long to wait before making a new request.

Also, if the requests are made using Privoxy and Tor as described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/, it may be possible to implement retry middleware which makes Tor change its IP address when this occurs. Are there any public examples of such code?

Kurt Peek
  • 52,165
  • 91
  • 301
  • 526
  • 1
    That response is coming from the target website you are scraping, its because you are accessing them too much with same IP. Only solution I see is to use rotating proxies from StormProxies , Crawlera or ProxyMesh etc – Umair Ayub Apr 26 '17 at 11:31

5 Answers5

26

You can modify the retry middleware to pause when it gets error 429. Put this code below in middlewares.py

    from scrapy.downloadermiddlewares.retry import RetryMiddleware
    from scrapy.utils.response import response_status_message
    
    import time
    
    class TooManyRequestsRetryMiddleware(RetryMiddleware):
    
        def __init__(self, crawler):
            super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
            self.crawler = crawler
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def process_response(self, request, response, spider):
            if request.meta.get('dont_retry', False):
                return response
            elif response.status == 429:
                self.crawler.engine.pause()
                time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
                self.crawler.engine.unpause()
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
            elif response.status in self.retry_http_codes:
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
            return response 

Add 429 to retry codes in settings.py

RETRY_HTTP_CODES = [429]

Then activate it on settings.py. Don't forget to deactivate the default retry middleware.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}
Alon Barad
  • 1,491
  • 1
  • 13
  • 26
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
  • Is it necessary to add `RETRY_HTTP_CODES = [429]` ? Doesn't your middleware have a separate case condition to handle 429, such that it doesn't need to be in the set `retry_http_codes`? thanks. – awaage Sep 20 '18 at 16:59
  • It won't get retried if it's not included in retry_http_codes. Yup, I can also handle 429 using another form of middleware, but I personally prefer to retry it a few times first before marking it as a total error. – Aminah Nuraini Sep 21 '18 at 11:04
  • What does the `543` value do? – Arthur Julião Nov 19 '18 at 22:59
  • That's priority value – Aminah Nuraini Nov 20 '18 at 12:00
  • @AminahNuraini : I am trying to invoke TooManyRequestsRetryMiddleware in Spider class. Code from Test.py. This is my actual spider class TestSpider(scrapy.Spider): name = 'Test' custom_settings = { 'DOWNLOADER_MIDDLEWARES' : { 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, # To handle 429 'myproject.middlewares.TooManyRequestsRetryMiddleware': 543, # To handle 429 } } TooManyRequestsRetryMiddleware is in middlewares.py but it is giving following error: any suggestion? ModuleNotFoundError: No module named 'myproject' – Vidyadhar Sep 17 '19 at 04:20
  • change myproject to your own project's name – Aminah Nuraini Sep 17 '19 at 08:11
  • That is my project name. – Vidyadhar Sep 17 '19 at 15:17
  • 1
    Doing a `time.sleep(60)` inside `process_response` in a middleware is not a good idea. This will block scrapy from doing anything else during these 60 seconds. Means other requests will time out and everything will get really slow. – Done Data Solutions Nov 23 '20 at 13:45
  • @DoneDataSolutions What do you recommend doing? – CountDOOKU Jul 07 '22 at 09:54
  • Instead of the hard `time.sleep(60)`? Play along with scrapy's architecture. In settings.py set `DOWNLOAD_DELAY` to high values, set `CONCURRENT_REQUEST...` to low values and make sure scrapy's builtin retry middleware is active and 429 is in the list of `RETRY_HTTP_CODES` – Done Data Solutions Jul 07 '22 at 18:09
  • @AminahNuraini How do you 'activate' it? on ```settings.py``` – CountDOOKU Jul 08 '22 at 02:03
  • @DoneDataSolutions If we don't block requests to that domain, and if we use the default retry middleware with 429 in `RETRY_HTTP_CODES`, won't the spider keep pinging the same domain (for other requests)? Won't this prolong the temporary block in place? Isn't there a way to pause requests to that particular domain while the temporary block is lifted (a worst-case condition after setting polite concurrent requests and autothrottle settings). – Parth Kapadia Dec 14 '22 at 10:55
  • @ParthKapadia if you want to stop doing any requests to a domain that blocks you, you'll likely need your own implementation for this. Ideally as a middleware. If you just want to go slower on such sites, it might help to also add the autothrottle extension as this will try slowing down requests to unresponsive domains. – Done Data Solutions Dec 15 '22 at 11:51
11

Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.

Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.

Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.

Solutions:

  1. Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.

  2. If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting

Done Data Solutions
  • 2,156
  • 19
  • 32
2

Building upon Aminah Nuraini's answer, you can use Twisted's Deferreds to avoid breaking asynchrony by calling time.sleep()

from twisted.internet import reactor, defer
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

async def async_sleep(delay, return_value=None):
    deferred = defer.Deferred()
    reactor.callLater(delay, deferred.callback, return_value)
    return await deferred

class TooManyRequestsRetryMiddleware(RetryMiddleware):
    """
    Modifies RetryMiddleware to delay retries on status 429.
    """

    DEFAULT_DELAY = 60  # Delay in seconds.
    MAX_DELAY = 600  # Sometimes, RETRY-AFTER has absurd values

    async def process_response(self, request, response, spider):
        """
        Like RetryMiddleware.process_response, but, if response status is 429,
        retry the request only after waiting at most self.MAX_DELAY seconds.
        Respect the Retry-After header if it's less than self.MAX_DELAY.
        If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
        """

        if request.meta.get('dont_retry', False):
            return response

        if response.status in self.retry_http_codes:
            if response.status == 429:
                retry_after = response.headers.get('retry-after')
                try:
                    retry_after = int(retry_after)
                except (ValueError, TypeError):
                    delay = self.DEFAULT_DELAY
                else:
                    delay = min(self.MAX_DELAY, retry_after)
                spider.logger.info(f'Retrying {request} in {delay} seconds.')

                spider.crawler.engine.pause()
                await async_sleep(delay)
                spider.crawler.engine.unpause()

            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        return response

The line await async_sleep(delay) blocks process_response's execution until delay seconds have passed, but Scrapy is free to do other stuff in the meantime. This async/await corutine syntax was introduced in Python 3.5 and support for it was added in Scrapy 2.0.

It's still necessary to modify settings.py as in the original answer.

Ivan Lonel
  • 186
  • 4
  • I'm getting an `AttributeError: 'NoneType' object has no attribute 'meta'` error after calling `_retry()`. I guess it's because the `deferred` returns `None`, but I'm not sure what to do, maybe pass `self._retry(request, reason, spider)` instead, like the original? – daneeq Oct 27 '21 at 16:12
  • 1
    @daneeq You are right, the first argument should have been `request` all along. Thanks for the heads up. I've updated the code to what I use today, which I believe is a bit more clear than the previous version. – Ivan Lonel Oct 28 '21 at 02:50
0

You can use HTTPERROR_ALLOWED_CODES =[404,429]. I was getting 429 HTTP code and I just allowed it and then problem fixed. You can allow the HTTP code that you are getting in terminal. This may be solve your problem.

David Buck
  • 3,752
  • 35
  • 31
  • 35
mel
  • 19
  • 3
  • 2
    adding it to the allowed http status, will not solve the problem to access the content on these pages, with 429, no content delivered, he need the content for sure. – Amr Alaa Jun 25 '20 at 18:39
0

Here is what I found, a simple trick

import scrapy
import time    ## just add this line

BASE_URL = 'your any url'
class EthSpider(scrapy.Spider):
    name = 'eth'
    start_urls = [
        BASE_URL.format(1)
    ]
    pageNum = 2

def parse(self, response):
    data = response.json()
    
    for i in range(len(data['data']['list'])):
        yield data['data']['list'][i]

    next_page = 'next page url'

    time.sleep(0.2)      # and add this line

    if EthSpider.pageNum <= data['data']['page']:
        EthSpider.pageNum += 1
        yield response.follow(next_page, callback=self.parse)  
Josef
  • 2,869
  • 2
  • 22
  • 23