0

I'm writing unit tests with Catch2 for some code that uses MPI. A failed test in Catch2 is basically a failed assertion with helpful error messages. Now, in some cases, mpirun doesn't seem to detect when a test fails on just one process. Then a deadlock can occur.

Is there any more or less elegant way to check at a specific point in the code, whether all processes are still alive? If not, all processes shall terminate.

RL-S
  • 734
  • 6
  • 21
  • You can always ask the nodes to return a signal like their ranks to node 0. If some of them has no response then you know something has gone wrong and should terminate. – stackoverblown Jun 23 '20 at 17:33
  • Well.. What would you use for that? Any collective communication would result in a deadlock again, so that wouldn't work. Do you mean loop over the ranks and use something like good (bad?) old ```MPI_Send``` and ```MPI_Recv```? – RL-S Jun 23 '20 at 20:40
  • 1
    replace `assert(a)` with `if (!a) MPI_Abort(1, MPI_COMM_WORLD)` – Gilles Gouaillardet Jun 23 '20 at 23:56
  • @GillesGouaillardet That sounds like a promising path, thank you. I'll try to find out how I can manipulate the source of my testing library to do just that. Hopefully, a ```#define``` right before inclusion of the library can do the trick. It is header-only, after all. – RL-S Jun 24 '20 at 12:02

1 Answers1

0

Most MPI launchers terminate the whole MPI job if they notice any one rank exiting prematurely without calling MPI_Finalize(). If your test-failing ranks all exit normally and clean up the MPI environment by calling MPI_Finalize(), the launcher will happily continue executing the rest of the job. If you want one failing rank to terminate the whole MPI job for sure, it must call MPI_Abort(something, MPI_COMM_WORLD) before calling MPI_Finalize(). something is an error code which, depending on the MPI implementation, may become the process exit code of the MPI launcher (mpirun in your case). I am not familiar with Catch2, so you should investigate how and where to incorporate the call to MPI_Abort.

If you want a failing test in one rank to fail the test in all other ranks and then allow for the next test to commence, this is an entirely different story. The problem here is that if one or more ranks fail a test and stop participating in the communication while the rest of the ranks continue running, the program may run into a state where there is some communication involving the failed ranks and that will block and consequently result in a deadlock. MPI has no user-visible timeout mechanisms and no way to gracefully interrupt blocking calls, unless you make explicit use of non-blocking operations, so aborting the whole job is your best option.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186