Scrapy: Is it possible to pause Scrapy and resume after x minutes?

Question

I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.

Danil · Answer 1 · 2018-06-11T18:09:48.347

You can write your own retry middleware and put it to middleware.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep

class SleepRetryMiddleware(RetryMiddleware):
    def __init__(self, settings):
        RetryMiddleware.__init__(self, settings)

    def process_response(self, request, response, spider):
        if response.status in [403]:
            sleep(120)  # few minutes
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        return super(SleepRetryMiddleware, self).process_response(request, response, spider)

and don't forget change settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project.middlewares.SleepRetryMiddleware': 100,
}

score 4 · Answer 2 · edited May 23 '17 at 11:46

Scrapy is a Twisted-based Python framework. So, never use time.sleep or pause.until inside it! Instead, try using Deferred() from Twisted.

class ScrapySpider(Spider):
    name = 'live_function'

    def start_requests(self):
        yield Request('some url', callback=self.non_stop_function)

    def non_stop_function(self, response):

        parse_and_pause = Deferred()  # changed
        parse_and_pause.addCallback(self.second_parse_function) # changed
        parse_and_pause.addCallback(pause, seconds=10)  # changed

        for url in ['url1', 'url2', 'url3', 'more urls']:
            yield Request(url, callback=parse_and_pause)  # changed

        yield Request('some url', callback=self.non_stop_function)  # Call itself

    def second_parse_function(self, response):
        pass

More info here: Scrapy: non-blocking pause

Where does the `pause` come from in the line `parse_and_pause.addCallback(pause, seconds=10)`? Neither this or the linked question include the import statements from twisted and I can't find a mention of where pause is imported from in the docs. — Further Reading, Mar 26 '21 at 12:06

Scrapy: Is it possible to pause Scrapy and resume after x minutes?

2 Answers2

Linked