Slurm MPI Error: An ORTE Daemon has failed

Question

I have been having some issues with Slurm and openMPI on a cluster. Whenever I run any job which uses mpirun, I get the following error:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

This issue arose suddenly, and the problem seems to be ubiquitous across the compute nodes.

Seemingly related, srun is also now failing, with the message:

srun: error: Task launch for <jobid> failed on node <nodename>: Job credential expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Thanks for any help anyone might have!

EDIT: Adding an example

If I run mpirun hostname on the headnode, everything works fine. However, in a slurm allocation (salloc) when I run mpirun hostname, I get the error.

I will be fixing this, not a sysadmin. Do you have any ideas on how to fix the issue or what the root cause might be? — Alec Bills, Feb 23 '21 at 20:19
Try `srun hostname` within `salloc`. If it’s broken, then this is a SLURM issue. — Gilles Gouaillardet, Feb 23 '21 at 22:50
@GillesGouaillardet I tried `srun hostname` from an salloc, and that results in the same error as in the post for `srun`. I restarted one of the nodes, and got a message saying that the slurmd failed to start, so I logged into that node and ran `systemctl start slurmd.service` and it started it. After that, `systemctl start slurmd.service` indicates that the node daemon is working. — Alec Bills, Feb 24 '21 at 03:10
so this is a pure SLURM issue (e.g. `Open MPI` is out of the picture for now. Did you check all nodes are time synchronized? — Gilles Gouaillardet, Feb 24 '21 at 03:58
I think this is probably the issue. The slurmctld.log is showing an error to that effect. Trying to figure out how to fix it now. — Alec Bills, Feb 24 '21 at 15:34

Slurm MPI Error: An ORTE Daemon has failed

0 Answers0