The impact of correlated failures on cluster performance

Question

In several presentations (e.g, 1, 2, 3) on cluster management, one of the scheduler's objectives is to reduce coordinated failures by distributing tasks of a single job across computing nodes that are less likely to fail together.

Why are correlated failures of tasks within a single job undesirable? If I understood correctly, all the tasks need to finish before the job is complete. So at first glance, it's better if task failures are limited to a small number of jobs, so that only those jobs experience the delay due to re-submission of failed tasks.

I would understand if all the tasks in a job were simply replicating the same work, but with hundreds of tasks per job that can't be the case (perhaps there are 3-4 identical tasks for fault tolerance purpose, and I do understand why it's important to reduce correlated failures for those groups of tasks).

score 0 · Answer 1 · answered Feb 03 '17 at 20:32

I figured out what I missed. I somehow thought of a job that splits work statically across a pre-determined set of tasks.

Actually, in the context of cluster management, work is split between tasks dynamically. So tasks are like workers: they announce their availability, say, to a load balancer, and then get dynamically assigned some parts of work.

With this clarification, everything is obvious.

If a task failed, the load balancer will simply re-allocate the corresponding work to other tasks with a slight deterioration in the job performance metric (time to completion in case of a batch job; latency in case of a service job). However, if too many tasks fail in a single job, the job performance will suffer too much. This is precisely why correlated failures are undesirable.

The impact of correlated failures on cluster performance

1 Answers1