Scalable algorithm to detect stale data

Question

Here is the problem:

"Agents" installed on many different servers send "heartbeat" signals up to a central server every 5 seconds. How can I find the ones that have missed their heartbeat for more than 10 seconds actively and raise an alert?

The problem is simple if you don't think about scalablity. In the simplest form, you can record the timestamp of the latest heartbeat received from each agent in a database table and run a regular query to find the ones older than the threshold.

This solution is however not scalable to millions of agents.

I am looking for algorithms or techologies that make this possible.

What do you mean by "scalable"? Would running a separate thread to monitor each agent and to actively flag if there were no heartbeat for ten seconds be "scalable"? — rossum, Aug 16 '11 at 11:19
Perhaps I should have written "central system" instead of "a central server". I did not mean to have one single server. I meant a central system as opposed to a distributed system. Central systems are certainly scalable. — Khash, Aug 20 '11 at 08:38

Lior Kogan · Accepted Answer · 2011-08-16T20:45:40.397

2

Use a map: AgentId --> LastHearbeatTime
Use 11 sets (assuming a resolution of 1 second is enough), each holds Ids of Agents reported in a 1-second window.

Every time an agent report a hearthbeat: 1. Find it in the map 2. Delete it from the relevant set 3. Update it in the map 4. Add it to the relevant set

Define a thread: Once per second, the oldest set expires. It should be empty. If it doesn't - it contains Ids of agents which did not report. Once a set expires, you can reuse it (cyclic array of sets).

I believe it can be implemented without locks (maybe you'll need 12 sets).

edited Aug 16 '11 at 20:45

answered Aug 16 '11 at 20:40

Lior Kogan

19,919
6
53
85

Thats a solid solution recipe dude. – scope_creep Jan 02 '12 at 18:39
Can you give some examples of further reading on how to implement this? – Jonny Flowers Nov 18 '14 at 14:20

score 1 · Answer 2 · answered Aug 16 '11 at 21:53

Without knowing language and platform it's somewhat hard to advise you on a detailed implementation, however my advice is somewhat similar to Lior Kogan's. In my opinion, however, you only need two sets and no map is involved:

Say you have two variables representing sets, A and B.

Every heartbeat removes the agent id from set A. Every 5 seconds, a different thread raises an alert for every agent id in B, then sets B = A, and last but not least creates a set with all of the agent ids and sets A to equal that (if the number of agent ids is really large, you can prepare the new set between one check and the other and only sleep for the remaining time). Locking would only be needed while changing the variables pointing to each set, provided you use a lock-free set collection. Performance will largely depend on the algorithmic complexity of said implementation, and if you go down this way, you should privilege the one with best performance (not necessarily best big-O, for instance if wost-case latency matters to you) for removals.

As a side note, if memory is not an issue or failures are relatively few, when you check whether you need to raise alerts and do so, you can do that on a thread of its own and getting possibly interesting performance speedups (again, the platform and runtime matter, for in erlang that would be a breeze but in Windows the cost of creating a full-blown new thread might exceed the performance benefit if the failures are few) at the cost of keeping the old B set in memory.

I think this is a good improvement on Linor's solution. Since we are not too sensitive about the detection resolution (10 - 15 seconds is enough), we are going to use a varian't of your and Lior's solution with three sets (1. All agents, 2. Heartbeat received, 3. Pending heartbeat). With a cycling every 5 seconds, we can detect stage agents in between 10 to 15 seconds. — Khash, Aug 20 '11 at 09:06

score -1 · Answer 3 · answered Aug 16 '11 at 20:17

MongoDB is great for this type of use. While not exactly an algorithm, it does fit the bill for a fundamental technology that is needed to create this service. We use it here at CopperEgg for our RevealCloud product to do exactly what you say - we send an alert when the system has gone away for some bit of time - sampling every 5 seconds. I'd love to hear more about your thoughts and use case. Can you provide more details?

Scalable algorithm to detect stale data

3 Answers3

Linked