0

I'm trying to use scrapinghub to crawl a website that heavily limits request rate.

If I run the spider as-is, I get 429 pretty soon.

If I enable crawlera as per standard instructions, the spider doesn't work anymore.

If I set headers = {"X-Crawlera-Cookies": "disable"} the spider works again, but I get 429s -- so I assume the limiter works (also) on the cookie.

So what would an approach be here?

kenshin
  • 197
  • 11
  • There are many strategies to increase the rate. You can increase the parameters `CONCURRENT_REQUESTS` but not to much because the website brake your rate with response 500. You can duplicate your spider and split the set of urls you want to crawl to dedicate each slice to each duplication. That's a large subject. Besides if without cookies it works very well and better, do it simply. – AvyWam Sep 09 '19 at 19:16
  • @AvyWam No, without cookies it does not work: they need to be handled by scrapy, to maintain them between requests. And that's my point: if they need to be kept, the Crawlera middleware seems useless to me, unless it exposed also sessions - but [it doesn't](https://github.com/scrapy-plugins/scrapy-crawlera/issues/27) – kenshin Sep 10 '19 at 07:34

1 Answers1

0

You can try RandomUserAgent, If you don't want to write your own implementation, you can try use this:

https://github.com/cnu/scrapy-random-useragent

Manualmsdos
  • 1,505
  • 3
  • 11
  • 22
  • I didn't know about that, thanks. But why would it help if the limiting is done based on cookies? – kenshin Sep 10 '19 at 07:31
  • UserAgent is a part of cookies, but I think you're right, you need some more, maybe change Session ID? https://en.wikipedia.org/wiki/Session_ID – Manualmsdos Sep 10 '19 at 07:47