0

There is app which starting with mpirun. If compute node fail then all processes crush, but if only head node fail(for example reboot) then processes will stuck on compute nodes. How to get rid of this zombie processes automatically?

Severgun
  • 163
  • 2
  • 8
  • 1
    are you launching the job via a scheduler (torque/slurm) or using just mpirun? usually the scheduler should take care of the cleanup. if not maybe look at MCA variables like orte_abort_on_non_zero_status. – Tux_DEV_NULL Nov 07 '17 at 15:03
  • launching with torque but it behave like I explained before. If head node die, torque delete job from queue, but process on compute nodes stuck. – Severgun Nov 08 '17 at 06:26
  • 1
    Do you see any error messages? mpi error or any in the torque logs? Look at the orte-clean tool https://www.open-mpi.org/doc/v2.0/man1/orte-clean.1.php – Tux_DEV_NULL Nov 08 '17 at 08:15

0 Answers0