Optimizing number of workers scraping a website

Question

I'm interested in scraping a particular website periodically that has ~100 million items on it. The scraper can download and process items very quickly, on the order of 50ms, but even at that speed it will take days to complete.

The obvious solution is to use multiple scrapers. However, at some point the underlying webservice will become saturated, and start to slow down. I want to be respectful of the service and not DDoS it, while scraping as efficiently as possible.

This is clearly an optimization problem but I'm not sure how to approach modeling it. Ideally I need to know the number of scrapers, and what delay to target for each of them. Any ideas?

You may want to look at the [URL Frontier approach](https://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html) in crawling. — Harsh Gupta, Feb 15 '18 at 18:01
Additionally, you may want [Frontera](https://pypi.python.org/pypi/frontera). I am neither advertising Frontera nor related to it in any way. — Harsh Gupta, Feb 15 '18 at 18:03
This looks great @HarshGupta, if you put it as an answer I will accept it! — Allyl Isocyanate, Feb 15 '18 at 20:16

score 2 · Accepted Answer · answered Feb 16 '18 at 08:07

2

You may try out the URL frontier approach for crawling.

There is a Python library called Frontera that implements the same approach.

Disclaimer: I am not endorsing/advertising Frontera nor related to it in any way.

answered Feb 16 '18 at 08:07

Harsh Gupta

4,348
2
25
30

Optimizing number of workers scraping a website

1 Answers1