I am writing a program to detect the sudden crash of remote machine. The manager process runs on machine 1 and the worker process runs on machine 2. The manager server sends a message to the worker process by calling MPI_Isend
. The remote worker get message by calling MPI_Irecv
. After each call, I always check their return code to see if there is issue with MPI_COMM_WORLD
. I also check return code of MPI_Test
which runs after the send and recv calls.
Somehow, the return code is always 0 even after I rebooted machine 2 suddenly. I can see MPI_Isend
always return value 0. Please give me some advice on how to detect remote machine failure.
BTW, I did use following statement:
MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);