1

I'm interested in scraping a particular website periodically that has ~100 million items on it. The scraper can download and process items very quickly, on the order of 50ms, but even at that speed it will take days to complete.

The obvious solution is to use multiple scrapers. However, at some point the underlying webservice will become saturated, and start to slow down. I want to be respectful of the service and not DDoS it, while scraping as efficiently as possible.

This is clearly an optimization problem but I'm not sure how to approach modeling it. Ideally I need to know the number of scrapers, and what delay to target for each of them. Any ideas?

Allyl Isocyanate
  • 13,306
  • 17
  • 79
  • 130

1 Answers1

2

You may try out the URL frontier approach for crawling.

There is a Python library called Frontera that implements the same approach.

Disclaimer: I am not endorsing/advertising Frontera nor related to it in any way.

Harsh Gupta
  • 4,348
  • 2
  • 25
  • 30