Distributed crawling and rate limiting / flow control

Question

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.

This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.

The system must adhere to the following 2 rules:

multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.

What i have tried:

I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.

Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?

Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?

Are you still open for discussion ? The thing is there's no one fix answer to this because there's so many style when it comes to distributed crawler, I've spent years reading for hobby but now i building one for production. Let me know if you are open to discuss about distributed crawlers , drop your telegram here — CodeGuru, Jan 29 '20 at 11:12

score 0 · Answer 1 · answered Jul 23 '18 at 09:55

0

Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

answered Jul 23 '18 at 09:55

Julien Nioche

4,772
1
22
28

Distributed crawling and rate limiting / flow control

1 Answers1