OMPI's mpiexec
kills the remaining ranks by first sending them SIGTERM
and then SIGKILL
(should any of them survive SIGTERM
). None of those signals results in core being dumped. You could probably install a signal handler for SIGTERM
that calls abort(3)
in order to force core dumps on kill.
Here is some sample code that works with Open MPI 1.6.5:
#include <stdlib.h>
#include <signal.h>
#include <mpi.h>
void term_handler (int sig) {
// Restore the default SIGABRT disposition
signal(SIGABRT, SIG_DFL);
// Abort (dumps core)
abort();
}
int main (int argc, char **argv) {
int rank;
MPI_Init(&argc, &argv);
// Override the SIGTERM handler AFTER the call to MPI_Init
signal(SIGTERM, term_handler);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Cause division-by-zero exception in rank 0
rank = 1 / rank;
// Make other ranks wait for rank 0
MPI_Bcast(&rank, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Open MPI's MPI_Init
installs special handlers for some known signals that either print useful debug information or generate backtrace files (.btr
). That's why the SIGTERM
handler has to be installed after the call to MPI_Init
and the default action of SIGABRT
(used by abort(3)
) has to be restored before calling abort()
.
Note that the signal handler will appear at the top of the call stack in the core file:
(gdb) bt
#0 0x0000003bfd232925 in raise () from /lib64/libc.so.6
#1 0x0000003bfd234105 in abort () from /lib64/libc.so.6
#2 0x0000000000400dac in term_handler (sig=15) at test.c:8
#3 <signal handler called>
#4 0x00007fbac7ad0bc7 in mca_btl_sm_component_progress () from /path/libmpi.so.1
#5 0x00007fbac7c9fca7 in opal_progress () from /path/libmpi.so.1
...
I would rather recommend that you should use a parallel debugger such as TotalView or DDT, if you have one at your disposal.