1

In several presentations (e.g, 1, 2, 3) on cluster management, one of the scheduler's objectives is to reduce coordinated failures by distributing tasks of a single job across computing nodes that are less likely to fail together.

Why are correlated failures of tasks within a single job undesirable? If I understood correctly, all the tasks need to finish before the job is complete. So at first glance, it's better if task failures are limited to a small number of jobs, so that only those jobs experience the delay due to re-submission of failed tasks.

I would understand if all the tasks in a job were simply replicating the same work, but with hundreds of tasks per job that can't be the case (perhaps there are 3-4 identical tasks for fault tolerance purpose, and I do understand why it's important to reduce correlated failures for those groups of tasks).

max
  • 49,282
  • 56
  • 208
  • 355

1 Answers1

0

I figured out what I missed. I somehow thought of a job that splits work statically across a pre-determined set of tasks.

Actually, in the context of cluster management, work is split between tasks dynamically. So tasks are like workers: they announce their availability, say, to a load balancer, and then get dynamically assigned some parts of work.

With this clarification, everything is obvious.

If a task failed, the load balancer will simply re-allocate the corresponding work to other tasks with a slight deterioration in the job performance metric (time to completion in case of a batch job; latency in case of a service job). However, if too many tasks fail in a single job, the job performance will suffer too much. This is precisely why correlated failures are undesirable.

max
  • 49,282
  • 56
  • 208
  • 355