How to use ZooKeeper to distribute work across a cluster of servers

Question

I'm studying up for system design interviews and have run into this pattern in several different problems. Imagine I have a large volume of work that needs to be repeatedly processed at some cadence. For example, I have a large number of alert configurations that need to be checked every 5 min to see if the alert threshold has been breached.

The general approach is to split the work across a cluster of servers for scalability and fault tolerance. Each server would work as follows:

start up
read assigned shard
while true:
  process the assigned shard
  sleep 5 min

Based on this answer (Zookeeper for assigning shard indexes), I came up with the following approach using ZooKeeper:

When a server starts up, it adds itself as a child under the node /service/{server-id} and watches the children of the node. ZooKeeper assigns a unique sequence number to the server.
Server reads its unique sequence number i from ZooKeeper. It also reads the total number of children n under the /service node.
Server identifies its shard by dividing the total volume of work into n pieces and locating the ith piece.
While true:
- If the watch triggers (because servers have been added to or removed from the cluster), server recalculates its shard.
- Server processes its shard.
- Sleep 5 min.

Does this sound reasonable? Is this generally the way that it is done in real world systems? A few questions:

In step #2, when the server reads the number of children, does it need to wait a period of time to let things settle down? What if every server is joining at the same time?
I'm not sure how timely the watch would be. Seems like there would be a time period where the server is still processing its shard and reassignment of shards might cause another server to pick up a shard that overlaps with what this server is processing, causing duplicate processing (which may or may not be ok). Is there any way to solve this?

Thanks!

SO is a programming Q&A platform and this question is not about programming. Questions about operating systems, their utilities, networking and hardware, are off topic here. [What topics can I ask about here?](https://stackoverflow.com/help/on-topic). Please delete this — Rob, Dec 02 '22 at 00:29
@Rob Your link says: "The best Stack Overflow questions generally have a bit of source code in them, but if your question generally covers… a specific programming problem, or a software algorithm, or software tools commonly used by programmers; and is a practical, answerable problem that is unique to software development …then you’re in the right place to ask your question!" My question is about ZooKeeper, which is a software tool commonly used by programmers. — Frank Epps, Dec 02 '22 at 00:32
According to the apache-zookeeper tag, it is not a programming question at all and I don't see anything programming related in your question. — Rob, Dec 02 '22 at 00:33
ZooKeeper is a library used by programmers to implement distributed systems. There may not be source code in the question but it is absolutely a question about a practical situation that could be implemented in a program. — Frank Epps, Dec 02 '22 at 00:35

How to use ZooKeeper to distribute work across a cluster of servers

0 Answers0