distributed crawler and consistency

Question

The case is that we have multiple servers (40+) to scrape one same URL at the same time (to make sure we have smallest legacy) and save the data into the database (MySQL).

and the problem now is that: the data is switch back and forth. for example, the content would be A <-> B <-> A <-> B <-> A in few seconds due to the crawler/database legacy.

is there a good way to prevent it? We're writing the crawler with Perl but any language would be fine for us since we can borrow the idea behind.

Any tip would be really appreciated. Redis? ZeroMQ?

Thanks

Does each of these application servers share the same database, or do they run their own? If the storage is centralised, your consistency problem is with the crawler processes taking different times to complete, so it is possible that process A starts, then B starts and finishes, then A finishes, and you get out of date data. If the data has a timestamp you can use that to store it to keep the newest version, or you can attach a timestamp (in milliseconds) straight after the fetch has finished and use that. It doesn't matter what technology or storage you use. — simbabque, Jun 04 '19 at 14:29
it shares the same database. yes, the case is that crawler X may get out of date data. and the data does not have a timestamp. we only have time() of that server when it finishes. so what's the best solution to store that mini-second with MySQL? right now we're using INSERT INTO .. ON DUPLICATE KEY UPDATE ... thanks — Fayland Lam, Jun 04 '19 at 14:33
You are overwriting the data, not storing each record separately? I would use an integer field with micro-seconds. Use Time::HiRes to get the `microtime` and on your LWP, install [a handler](https://metacpan.org/pod/LWP::UserAgent#HANDLERS) for either `respone_headers` or `response_done` and stick the microtime in there. Then use that for detecting which one is the newest. — simbabque, Jun 04 '19 at 15:04

score 2 · Answer 1 · answered Jun 05 '19 at 09:44

2

Lock a row so another process cannot update it.

answered Jun 05 '19 at 09:44

daxim

39,270
4
65
132

Could you please quote or summarise the relevant part of the doc here? As it stands this is a link-only answer. I would do it for you, but I can't read the whole thing now unfortunately. – simbabque Jun 05 '19 at 10:21

distributed crawler and consistency

1 Answers1