4

I use MPICH2. When I launch processes with mpiexec, the failure of one process will crash all other processes. How to avoid this?

dodolong
  • 855
  • 2
  • 9
  • 14
  • Don't avoid it! That is the desired behaviour in 99.9% of the cases. Why would you want otherwise? – Gilles Jul 22 '16 at 06:28
  • We want to implement fault-recovery: one process crash we just restart this one. – dodolong Jul 22 '16 at 06:49
  • 5
    Well, you can't since MPI doesn't support it. Fault tolerance has been a topic of research in the MPI community for decades, and was expected to land in MPI 3.0, which it didn't. Maybe for MPI 4.0... – Gilles Jul 22 '16 at 06:52
  • You question is a little generic, there's an overview of recent efforts here: http://stackoverflow.com/a/23919726/491687 – Wesley Bland Jul 22 '16 at 18:53

1 Answers1

4

In MPICH, there is a flag called -disable-auto-cleanup which will prevent the process manager from automatically cleaning up all processes when a single process fails.

However, MPI itself does not have much support for fault tolerance and this is something that the Fault Tolerance Working Group is working on adding in a future version of the MPI Standard.

For now, the best you can do is change the default MPI Error Handler away from MPI_ERRORS_ARE_FATAL, which causes all processes to abort, to something else like MPI_ERRORS_RETURN which would return the error code to the application and allow it to do something else. However, you're not likely to be able to communicate anymore after a failure has occurred, especially if you are trying to use collective communication.

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59