3

I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.

Crypto
  • 1,217
  • 3
  • 17
  • 33

2 Answers2

7

You can write your own retry middleware and put it to middleware.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep

class SleepRetryMiddleware(RetryMiddleware):
    def __init__(self, settings):
        RetryMiddleware.__init__(self, settings)

    def process_response(self, request, response, spider):
        if response.status in [403]:
            sleep(120)  # few minutes
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        return super(SleepRetryMiddleware, self).process_response(request, response, spider)

and don't forget change settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project.middlewares.SleepRetryMiddleware': 100,
}
Danil
  • 4,781
  • 1
  • 35
  • 50
4

Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it! Instead, try using Deferred() from Twisted.

class ScrapySpider(Spider):
    name = 'live_function'

    def start_requests(self):
        yield Request('some url', callback=self.non_stop_function)

    def non_stop_function(self, response):

        parse_and_pause = Deferred()  # changed
        parse_and_pause.addCallback(self.second_parse_function) # changed
        parse_and_pause.addCallback(pause, seconds=10)  # changed

        for url in ['url1', 'url2', 'url3', 'more urls']:
            yield Request(url, callback=parse_and_pause)  # changed

        yield Request('some url', callback=self.non_stop_function)  # Call itself

    def second_parse_function(self, response):
        pass

More info here: Scrapy: non-blocking pause

Community
  • 1
  • 1
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
  • 1
    Where does the `pause` come from in the line `parse_and_pause.addCallback(pause, seconds=10)`? Neither this or the linked question include the import statements from twisted and I can't find a mention of where pause is imported from in the docs. – Further Reading Mar 26 '21 at 12:06