Most MPI launchers terminate the whole MPI job if they notice any one rank exiting prematurely without calling MPI_Finalize()
. If your test-failing ranks all exit normally and clean up the MPI environment by calling MPI_Finalize()
, the launcher will happily continue executing the rest of the job. If you want one failing rank to terminate the whole MPI job for sure, it must call MPI_Abort(something, MPI_COMM_WORLD)
before calling MPI_Finalize()
. something
is an error code which, depending on the MPI implementation, may become the process exit code of the MPI launcher (mpirun
in your case). I am not familiar with Catch2, so you should investigate how and where to incorporate the call to MPI_Abort
.
If you want a failing test in one rank to fail the test in all other ranks and then allow for the next test to commence, this is an entirely different story. The problem here is that if one or more ranks fail a test and stop participating in the communication while the rest of the ranks continue running, the program may run into a state where there is some communication involving the failed ranks and that will block and consequently result in a deadlock. MPI has no user-visible timeout mechanisms and no way to gracefully interrupt blocking calls, unless you make explicit use of non-blocking operations, so aborting the whole job is your best option.