I'm trying to run a scraper of which the output log ends as follows:
2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
'downloader/request_count': 32902,
'downloader/request_method_count/GET': 32902,
'downloader/response_bytes': 117633316,
'downloader/response_count': 32902,
'downloader/response_status_count/200': 121,
'downloader/response_status_count/429': 32781,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
'log_count/DEBUG': 32903,
'log_count/INFO': 32815,
'request_depth_max': 2,
'response_received_count': 32902,
'scheduler/dequeued': 32902,
'scheduler/dequeued/memory': 32902,
'scheduler/enqueued': 32902,
'scheduler/enqueued/memory': 32902,
'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)
In short, of the 32,902 requests, only 121 are successful (response code 200) whereas the remainder receives 429 for 'too many requests' (cf. https://httpstatuses.com/429).
Are there any recommended ways to get around this? To start with, I'd like to have a look at the details of the 429
response rather than just ignoring it, as it may contain a Retry-After header indicating how long to wait before making a new request.
Also, if the requests are made using Privoxy and Tor as described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/, it may be possible to implement retry middleware which makes Tor change its IP address when this occurs. Are there any public examples of such code?