El-cheapo way to monitor tasks in a cluster and restart if they crash (self-healing)?

Question

Consider a linux cluster of N nodes. It needs to run M tasks. Each task can run on any node. Assume the cluster is up and working normally.

Question: what's the simplest way to monitor the M tasks are running, and if a task exits abnormally (exit code != 0), start a new task on any of the up machines. Ignore network partitions.

Two of the M tasks have a dependency so that if task 'm' does down, task 'm1' should be stopped. Then 'm' is started and when up, 'm1' can be restarted. 'm1' depends on 'm'. I can provide an orchestration script for this.

I eventually want to work up to Kubernetes which does self-healing but I'm not there yet.

Does task m need to complete before m1 can be started? Does each taks consist of 'm' and 'm1'? — Thomas, Jul 28 '19 at 14:27
Task m needs to start and finish initialization only. Then m1 can start at anytime thereafter. — ecwdw 23e3e23e, Jul 30 '19 at 01:54

score 0 · Answer 1 · answered Jul 29 '19 at 10:48

The right (tm) way to do is to setup a retry, potentially with some back-off strategy. There were many similar questions here on StackOverflow how to do this - this is one of them.

If you still want to do the monitoring, and explicit task restart, then you can implement a service based on the task events that will do it for you. It is extremely simple, and a proof how brilliant Celery is. The service should handle the task-failed event. An example how to do it is on the same page.

Thank you - will checkout your links – ecwdw 23e3e23e Jul 30 '19 at 01:54 — ecwdw 23e3e23e, Jul 30 '19 at 01:54

score 0 · Answer 2 · answered Jul 30 '19 at 09:03

If you just need an initialization task to run for each computation task, you can use the Job concept along with an init container. Jobs are tasks that run just once until completion, Kubernetes will restart it, if it crashes. Init containers run before the actual pod containers are started and are used for initialization tasks: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

El-cheapo way to monitor tasks in a cluster and restart if they crash (self-healing)?

2 Answers2