1

Suppose that I run an MPI program involving 25 processes on 25 different machines. The program is initiated at one of them called the "master" with a command like

mpirun -n 25 --hostfile myhostfile.txt python helloworld.py

This is executed on Linux with some bash script and it uses mpi4py. Sometimes, in the middle of execution, I want to stop the program in all machines. I don't care if this is done graciously or not since the data I might need is already saved.

Usually, I press Ctrl + C on terminal of the "master" and I think it works as described above. Is this true? In other words, will it stop this specific MPI program in all machines?

Another method I tried is to get the PID of the process in the "master" and kill it. I am not sure about this either.

Do the above methods work as described? If no, what else do you suggest? Note that I want to avoid the use of MPI calls for that purpose like MPI_Abort that some other discussions here and here suggest.

mgus
  • 808
  • 4
  • 17
  • 39
  • 1
    When you send SIGINT to `mpirun` by either pressing Ctrl+C or by targeting the signal to its PID, e.g., via `kill -INT ...`, it catches the signal and uses some underlying mechanism, specific to the MPI implementation, to send a kill signal to all the MPI ranks in the job. – Hristo Iliev Aug 19 '20 at 07:41
  • Just out of curiosity, is this still true if one sends `SIGKILL` instead of `SIGINT`? – mgus Aug 19 '20 at 20:38
  • 1
    `SIGKILL` cannot be caught and using it will have implementation-dependent bad effects on the MPI job. Most likely the job will continue to run until the ranks notice that they can no longer communicate with `mpirun`. See [here](https://s3.amazonaws.com/revue/items/images/001/678/707/mail/dont-sigkill-2.png). – Hristo Iliev Aug 19 '20 at 21:27

0 Answers0