I would have five nodes: pending_tasks
, completed_tasks
, running_tasks
, workers
, and queues
. pending_tasks
is the node that holds tasks, both new ones and the ones that will be re-triggered due to failures in workers nodes. completed_tasks
holds completed tasks details. running_tasks
holds the tasks that are assigned to workers. In a PoC implementation I did once I used XML-encoded POJO to store tasks' details. Nodes in pending_tasks
, completed_tasks
, and running_tasks
are all persistent nodes.
workers
holds ephemeral nodes that represent available workers. Given they are ephemeral, these nodes signal failures in the workers. queues
is directly related to workers
: there is a node in queues
for each node in workers
. The nodes in queues
are used to hold the tasks assigned for each of the workers.
Now, you need a master. The master is responsible for three things: i) watch pending_tasks
for new tasks; ii) watch workers
to register new queues
when new workers arrive, and to put tasks back in pending_tasks
when workers went missing; and iii) publish the result of the tasks in completed_tasks
(when I did this PoC, the result would go through a publish/subscribe notification mechanism). Besides that, the master must perform some clean-up in the start-up given that workers might fail during masters' downtime.
The master algorithm is the following:
at (start-up) {
for (q -> /queues) {
if q.name not in nodesOf(/workers) {
for (t -> nodesOf(/queues/d.name)) {
create /pending_tasks/t.name
delete /running_tasks/t.name
delete /queues/d.name/t.name
}
delete /queues/d.name
}
}
for (t -> nodesOf(/completed_tasks)) {
publish the result
deleted /completed_tasks/c.name
}
}
watch (/workers) {
case c: Created => register the new worker queue
case d: Deleted => transaction {
for (t -> nodesOf(/queues/d.name)) {
create /pending_tasks/t.name
delete /running_tasks/t.name
delete /queues/d.name/t.name
}
delete /queues/d.name
}
}
watch (/pending_tasks) {
case c: Created => transaction {
create /running_tasks/c.name
create persistent node in one of the workers queue (eg, /queues/worker_0/c.name)
delete /pending_tasks/c.name
}
}
watch (/completed_tasks) {
case c: Created =>
publish the result
deleted /completed_tasks/c.name
}
The worker algorithm is the following:
at (start-up) {
create /queue/this.name
create a ephemeral node /workers/this.name
}
watch (/queue/this.name) {
case c: Created =>
perform the task
transaction {
create /completed_tasks/c.name with the result
delete /queues/this.name/c.name
delete /running_tasks/c.name
}
}
Some notes on when I thought of this design. First, at any given time, no tasks targeting the same computation were to run. Therefore, I named the tasks after the computation that was in place. So, if two different clients requested the same computation, only one would succeed since only one would be able to create the /pending_tasks
node. Likewise, if the task is already running, the creation of the /running_task/
node would fail and no new task would be dispatch.
Second, there might be arbitrary failures in both masters and workers and no task will be lost. If a worker fails, the watching of delete events in /worker
will trigger the reassignment of tasks. If a master fail and any given number of workers fail before a new master is in place, the start-up procedure will move tasks back to /pending_tasks
and publish any pending result.
Third, I might have forgotten some corner case since I have no access to this PoC implementation anymore. I'll be glad to discuss any issue.