In several presentations (e.g, 1, 2, 3) on cluster management, one of the scheduler's objectives is to reduce coordinated failures by distributing tasks of a single job across computing nodes that are less likely to fail together.
Why are correlated failures of tasks within a single job undesirable? If I understood correctly, all the tasks need to finish before the job is complete. So at first glance, it's better if task failures are limited to a small number of jobs, so that only those jobs experience the delay due to re-submission of failed tasks.
I would understand if all the tasks in a job were simply replicating the same work, but with hundreds of tasks per job that can't be the case (perhaps there are 3-4 identical tasks for fault tolerance purpose, and I do understand why it's important to reduce correlated failures for those groups of tasks).