How to distribute URL frontier and crawling workers in a distributed web crawler?

Question

I've read some articles about designing distributed web crawlers, but still have some questions related to the architecture, especially for distributing URL frontier and crawling workers. Currently, there're two options in my mind:

The URL frontier and crawling workers are separate microservices, while URL frontier sends URLs to be crawled to workers based on some hash or rules, ideally the URL of the same IP should go to the same worker (for controlling QPS and avoid duplicate TCP connections).

But in this way, the crawling workers seem not stateless, each worker maintains some queue for unvisited URLs, how does a worker recover from failures? Also, I feel the workers should be stateless so it would be better for scalability.

Use distributed message queues (e.g. Kafka) as the URL frontier, crawling workers get URLs from the queues, I guess a pull queue would be suitable for this case.

But for this solution, I'm not sure how to guarantee the same IP could go to the same worker, messages queues such as Kafka seems not have such custom routing strategies?

_Kafka seems not have such custom routing strategies_ - Unclear what you mean by this. If you key the messages on the IP, then all messages get "routed" to the same partition, in order. — OneCricketeer, Jul 12 '21 at 23:35

How to distribute URL frontier and crawling workers in a distributed web crawler?

0 Answers0