0

I have been having some issues with Slurm and openMPI on a cluster. Whenever I run any job which uses mpirun, I get the following error:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

This issue arose suddenly, and the problem seems to be ubiquitous across the compute nodes.

Seemingly related, srun is also now failing, with the message:

srun: error: Task launch for <jobid> failed on node <nodename>: Job credential expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Thanks for any help anyone might have!

EDIT: Adding an example

If I run mpirun hostname on the headnode, everything works fine. However, in a slurm allocation (salloc) when I run mpirun hostname, I get the error.

Alec Bills
  • 1
  • 1
  • 2
  • 1
    Write to your system administrators. – Hristo Iliev Feb 23 '21 at 19:41
  • I will be fixing this, not a sysadmin. Do you have any ideas on how to fix the issue or what the root cause might be? – Alec Bills Feb 23 '21 at 20:19
  • Try `srun hostname` within `salloc`. If it’s broken, then this is a SLURM issue. – Gilles Gouaillardet Feb 23 '21 at 22:50
  • @GillesGouaillardet I tried `srun hostname` from an salloc, and that results in the same error as in the post for `srun`. I restarted one of the nodes, and got a message saying that the slurmd failed to start, so I logged into that node and ran `systemctl start slurmd.service` and it started it. After that, `systemctl start slurmd.service` indicates that the node daemon is working. – Alec Bills Feb 24 '21 at 03:10
  • so this is a pure SLURM issue (e.g. `Open MPI` is out of the picture for now. Did you check all nodes are time synchronized? – Gilles Gouaillardet Feb 24 '21 at 03:58
  • I think this is probably the issue. The slurmctld.log is showing an error to that effect. Trying to figure out how to fix it now. – Alec Bills Feb 24 '21 at 15:34
  • Yup, that was the issue. Thanks! – Alec Bills Feb 24 '21 at 15:54

0 Answers0