How to handle temporary errors which are not signaled by http status code?

Question

I am writing a crawler using Scrapy (Python) and don't know how to handle certain errors.

I have got a website which sometimes returns an empty body or a normal page with an error message. Both replies come with a standard 200 HTTP status code.

What I want to do when I encounter such a situation is tell Scrapy to

don't save the response to cache (I am using HTTPCACHE_ENABLED = True) as the content for a successful request looks different
reschedule the request
reduce request rate (I am using AUTOTHROTTLE_ENABLED = True)

Is there an easy way like raising a certain exception a la raise scrapy.TemporaryError or do I have to do everything manually. In the later case, how do I delete content from the cache or talk to the autothrottle module?

I know I can use dont_cache on requests to not cache them. But usually I do want to cache my requests and only decide on the response if I want to keep it. Also the documentation is not clear weather this flag avoids saving the response of the request to cache or if it also avoids reading the request from cache...

Autothrottle uses the download latency to adjust the request rate. The throttling algorithm treats non-200 responses as failed responses and does not decrease the download delay. However my requests return 200 status codes. So autothrottle cannot handle the situation. There must be a way to tell autothrottle to use its throttling logic and treat these specific requests as failed.

Tarun Lalwani · Answer 1 · 2018-06-28T11:00:29.737

1

In your response you can check for a condition and decide to re-queue the URL.

requests disappear after queueing in scrapy

def parse(self, response):
    if blank_data or should_rescrape:
       yield Request(respone.url, dont_filter=True, callback=self.response)

Adjusting throttle dynamically

If you check self.crawler.extensions.middlewares, you will see that it has all loaded extensions

In my case

self.crawler.extensions.middlewares[5] gives <scrapy.extensions.throttle.AutoThrottle object at 0x10b75a208> (Of course you will loop through the tuple and find which one is of type AutoThrottle)

Now you can use this object and adjust the values dynamically in your scraper

edited Jun 28 '18 at 11:00

answered Jun 28 '18 at 07:45

Tarun Lalwani

142,312
9
204
265

Yes thank you. But this doesn't help me with caching or throttling. Usually 'bad' responses are returned pretty quickly, so this way I will just queue lots of requests and overload the server. – C. Yduqoli Jun 28 '18 at 09:14
What is your queue length when you requeue the url? So you want these urls to be queued for at least X amount of time? – Tarun Lalwani Jun 28 '18 at 09:37
My queue is not very large (thousands), but all requests are to the same server, so any requests might exhibit this behavior. I want autothrottle to go into effect (which uses the download latency to adjust request delay). However bad responses don't have a large latency, they are returned from the server pretty much instantaneously. So I want to tell autothrottle that I just received a bad request and it should start slowing down requests until it hits a rate which doesn't yield bad requests anymore. – C. Yduqoli Jun 28 '18 at 10:00

How to handle temporary errors which are not signaled by http status code?

1 Answers1