1

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.

This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.

The system must adhere to the following 2 rules:

  1. multiple workers (concurrency issues)
  2. variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.

What i have tried:

  • I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.

  • I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.

  • I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.

Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?

Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?

PinkFloyd
  • 118
  • 10
  • Are you still open for discussion ? The thing is there's no one fix answer to this because there's so many style when it comes to distributed crawler, I've spent years reading for hobby but now i building one for production. Let me know if you are open to discuss about distributed crawlers , drop your telegram here – CodeGuru Jan 29 '20 at 11:12

1 Answers1

0

Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28