I am trying to modify an MPI based system to add fault tolerance (process should continue if machines go down).
I was thinking of using Apache Zookeeper to handle the machine failure case. Is it the best way to proceed further? Also, what happens to the MPI calls (like send, receive, broadcast) when using Zookeeper? Send/Recv calls in MPI are typically bound to machine id (source/destination); now in an environment where machines fail and may never come back, how would it work?
What will be the performance drop by porting the existing application from MPI to Zookeeper based solution?