Scrapy throttling and request scheduling only microservices

Question

I'm currently using python requests to download around 20,000 pages of json. I'm running into some bottlenecking due to rate limiting by the server I'm scraping, and maybe a lack of asynchronous calls/scheduling. I thought scrapy would be a good solution because I heard it has features to combat these problems associated with scraping. The thing is, those are the only parts I need, I don't need spidering/parsing/orm/etc. Looking at the docs, it was unclear how I would seperate out just these components. I need a microservice for just these parts of what scrapy does. The Flask to Scrapy's Django. I saw grequests might help with async, but if I go that route I still need rate limiting and a way to retry failed requests. Can someone point me in the right direction?

score -1 · Answer 1 · answered Jun 10 '16 at 02:46

-1

if want you need is something to help you on rate limiting, I would recommend using a proxy rotation service, Scrapy won't be necessary if you already have your crawler ready.

I would recommend Crawlera or proxymesh.

answered Jun 10 '16 at 02:46

eLRuLL

18,488
9
73
99

That's an interesting approach. Would Tor help with this or have too much overhead/be too slow to speed things up. – sajattack Jun 10 '16 at 02:52
Tor would be also another approach, I can't say I tried it. – eLRuLL Jun 10 '16 at 02:53
Tor could help, but you will need some other libraries like Stem for controlling Tor from Python code, and will have to induce some logic on when should thr Tor Identity (proxy) should be changed. – Vikas Ojha Jun 10 '16 at 10:42

Scrapy throttling and request scheduling only microservices

1 Answers1