I am writing a crawler using Scrapy (Python) and don't know how to handle certain errors.
I have got a website which sometimes returns an empty body or a normal page with an error message. Both replies come with a standard 200 HTTP status code.
What I want to do when I encounter such a situation is tell Scrapy to
- don't save the response to cache (I am using
HTTPCACHE_ENABLED = True
) as the content for a successful request looks different - reschedule the request
- reduce request rate (I am using
AUTOTHROTTLE_ENABLED = True
)
Is there an easy way like raising a certain exception a la raise scrapy.TemporaryError
or do I have to do everything manually. In the later case, how do I delete content from the cache or talk to the autothrottle module?
I know I can use dont_cache
on requests to not cache them. But usually I do want to cache my requests and only decide on the response if I want to keep it. Also the documentation is not clear weather this flag avoids saving the response of the request to cache or if it also avoids reading the request from cache...
Autothrottle uses the download latency to adjust the request rate. The throttling algorithm treats non-200 responses as failed responses and does not decrease the download delay. However my requests return 200 status codes. So autothrottle cannot handle the situation. There must be a way to tell autothrottle to use its throttling logic and treat these specific requests as failed.