I have been having some issues with Slurm and openMPI on a cluster. Whenever I run any job which uses mpirun
, I get the following error:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
This issue arose suddenly, and the problem seems to be ubiquitous across the compute nodes.
Seemingly related, srun
is also now failing, with the message:
srun: error: Task launch for <jobid> failed on node <nodename>: Job credential expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Thanks for any help anyone might have!
EDIT: Adding an example
If I run mpirun hostname
on the headnode, everything works fine. However, in a slurm allocation (salloc
) when I run mpirun hostname
, I get the error.