3

I'm new in scrapy and I need to pause a spider after receiving a response error (like 407, 429).
Also, I should do this without using time.sleep(), and use middlewares or extensions.

Here is my middlewares:

from scrapy import signals
from pydispatch import dispatcher

class Handle429:
    def __init__(self):
        dispatcher.connect(self.item_scraped, signal=signals.item_scraped)

    def item_scraped(self, item, spider, response):
        if response.status == 429:
            print("THIS IS 429 RESPONSE")
            #
            # here stop spider for 10 minutes and then continue
            #

I read about self.crawler.engine.pause() but how can I implement it in my middleware, and set a custom time for pause?
Or is there another way to do this? Thanks.

Daniil
  • 51
  • 3

1 Answers1

1

I have solved my problem. First of all, middleware can have default foo like process_response or process_request.

In settings.py

HTTPERROR_ALLOWED_CODES = [404]

Then, I have changed my middleware class:

from twisted.internet import reactor
from twisted.internet.defer import Deferred

#replace class Handle429
class HandleErrorResponse:

    def __init__(self):
        self.time_pause = 1800

    def process_response(self, request, response, spider):
        # this foo called by default before the spider 
        pass

Then I find a code that helps me to pause spider without time.sleep()

#in HandleErrorResponse
def process_response(self, request, response, spider):
    print(response.status)
    if response.status == 404:
        d = Deferred()
        reactor.callLater(self.time_pause, d.callback, response)

    return response

And it's work.
I can't fully explain how reactor.callLater() works, but I think it just stops the event loop in scrapy, and then your response will be sent to the spider.

Daniil
  • 51
  • 3