4

I am writing a program to detect the sudden crash of remote machine. The manager process runs on machine 1 and the worker process runs on machine 2. The manager server sends a message to the worker process by calling MPI_Isend. The remote worker get message by calling MPI_Irecv. After each call, I always check their return code to see if there is issue with MPI_COMM_WORLD. I also check return code of MPI_Test which runs after the send and recv calls.

Somehow, the return code is always 0 even after I rebooted machine 2 suddenly. I can see MPI_Isend always return value 0. Please give me some advice on how to detect remote machine failure.

BTW, I did use following statement:

MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
Wesley Bland
  • 8,816
  • 3
  • 44
  • 59
  • 1
    Which MPI library are you using? Fault tolerance is a non-standard feature at the moment so it will depend heavily in whether you're using MPICH or Open MPI. – Wesley Bland May 02 '14 at 12:14
  • We are using MPICH either 2 or 3. We are trying to implement some basic fault tolerance features until Argonne roll out fault tolerance in the future. Thanks! – user3595139 May 02 '14 at 17:44
  • There is some fault tolerance in those versions, you add the flag `--disable-auto-cleanup` and it will return errors to you. On the other hand, if you're not seeing failures at all, that's weird and I'm not sure what's going on. – Wesley Bland May 02 '14 at 19:03
  • FYI, Argonne is working on better FT for a future release. Hopefully that will be out within a few months. – Wesley Bland May 02 '14 at 19:04
  • I used --disable-auto-cleanup. Same result. On Manager, here are the calls. rc = MPI_Isend(...) rc = MPI_Test(...) On workers, here are the calls. rc = MPI_Irecv(...) rc = MPI_Test(...) Did check return code. Always success. While the program is running, I rebooted one of the nodes. Here is the message, Connection to wb201.qa2.ch3.qa.i.com closed by remote host. As you can see, one of the workers is gone. The system printed some warning message. However, the manager process still sends message to that process and getting success return code back. Thank you! – user3595139 May 05 '14 at 20:46
  • 1
    Ah. There's no guarantee that short sends will return an error. It's possible that the messages are sent eagerly which does not require a round trip message. This means that if you need to know whether there was a failure, you'll need to do something that will force communication via something like an `MPI_BARRIER`. – Wesley Bland May 06 '14 at 02:46

1 Answers1

0

Probably should have turned this into an answer long ago to make it easier for other to track this down.


As has been discussed in other posts, MPI_Send and friend's completion does not necessarily indicate that a message has been received on the other end. Only MPI_Ssend implies any sort of completion and even that only indicates that the receiver has started receiving the message into its buffer.

For this particular problem, MPI_Ssend would probably be enough as it would indicate that a failure has occurred, though it would slow things down.

In the end, you can't rely on the sender side semantics to tell you that a failure has occurred without doing extra work in MPI. There's no guarantees built into the standard to do so because they would be expensive. If you must know quickly on the sender side, use MPI_Ssend. Otherwise, do a bunch of operations, then do something synchronizing later (like an MPI_Ssend or an MPI_Barrier if you want to validate all processes at once).

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59