I use MPICH2. When I launch processes with mpiexec, the failure of one process will crash all other processes. How to avoid this?
-
Don't avoid it! That is the desired behaviour in 99.9% of the cases. Why would you want otherwise? – Gilles Jul 22 '16 at 06:28
-
We want to implement fault-recovery: one process crash we just restart this one. – dodolong Jul 22 '16 at 06:49
-
5Well, you can't since MPI doesn't support it. Fault tolerance has been a topic of research in the MPI community for decades, and was expected to land in MPI 3.0, which it didn't. Maybe for MPI 4.0... – Gilles Jul 22 '16 at 06:52
-
You question is a little generic, there's an overview of recent efforts here: http://stackoverflow.com/a/23919726/491687 – Wesley Bland Jul 22 '16 at 18:53
1 Answers
In MPICH, there is a flag called -disable-auto-cleanup
which will prevent the process manager from automatically cleaning up all processes when a single process fails.
However, MPI itself does not have much support for fault tolerance and this is something that the Fault Tolerance Working Group is working on adding in a future version of the MPI Standard.
For now, the best you can do is change the default MPI Error Handler away from MPI_ERRORS_ARE_FATAL
, which causes all processes to abort, to something else like MPI_ERRORS_RETURN
which would return the error code to the application and allow it to do something else. However, you're not likely to be able to communicate anymore after a failure has occurred, especially if you are trying to use collective communication.

- 8,816
- 3
- 44
- 59