I've read some articles about designing distributed web crawlers, but still have some questions related to the architecture, especially for distributing URL frontier and crawling workers. Currently, there're two options in my mind:
- The URL frontier and crawling workers are separate microservices, while URL frontier sends URLs to be crawled to workers based on some hash or rules, ideally the URL of the same IP should go to the same worker (for controlling QPS and avoid duplicate TCP connections).
But in this way, the crawling workers seem not stateless, each worker maintains some queue for unvisited URLs, how does a worker recover from failures? Also, I feel the workers should be stateless so it would be better for scalability.
- Use distributed message queues (e.g. Kafka) as the URL frontier, crawling workers get URLs from the queues, I guess a pull queue would be suitable for this case.
But for this solution, I'm not sure how to guarantee the same IP could go to the same worker, messages queues such as Kafka seems not have such custom routing strategies?