How to add fault tolerance support to an existing MPI based system such that the system continues even after a machine goes down?

Question

I am trying to modify an MPI based system to add fault tolerance (process should continue if machines go down).

I was thinking of using Apache Zookeeper to handle the machine failure case. Is it the best way to proceed further? Also, what happens to the MPI calls (like send, receive, broadcast) when using Zookeeper? Send/Recv calls in MPI are typically bound to machine id (source/destination); now in an environment where machines fail and may never come back, how would it work?

What will be the performance drop by porting the existing application from MPI to Zookeeper based solution?

Fault tolerance is still not part of the MPI standard. With most MPI implementations, if even one process dies because of reasons, the whole MPI job gets aborted. — Hristo Iliev, Jul 10 '15 at 18:05
Yes, I am aware of that and hence, I am looking at other options like Zookeeper/Curator. — JhnElaine, Jul 11 '15 at 00:36
@HristoIliev That's not accurate. All MPI functions return error codes. If you set your own error handler, you can write fault-tolerant in some scenarios. It's just that most _implementations_ don't do a good job of supporting this. — Jeff Hammond, Jul 11 '15 at 01:15
@Jeff, given that according to the MPI standard: _"After an error is detected, the state of MPI is undefined. That is, using a user-defined error handler, or MPI_ERRORS_RETURN, does not necessarily allow the user to continue to use MPI after an error is detected. The purpose of these error handlers is to allow a user to issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O buffers) before a program exits."_ Writing _portable_ fault-tolerant MPI code that continues running after a node failure is currently pretty much problematic. — Hristo Iliev, Jul 11 '15 at 20:37

How to add fault tolerance support to an existing MPI based system such that the system continues even after a machine goes down?

0 Answers0