I get the following error when trying to submit a job with sbatch:
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
When I use sbatch with no parameters it runs fine, but when I try to pass any parameter (e.g. --job-name
or --export
) with sbatch, the above error appears.
I am using openmpi 3 and running a python script with mpirun. Both mpirun and orted appear to be using the same openmpi version, as evidenced by calling which
in my slurm script right before using mpirun:
which mpirun: /opt/openmpi30/bin/mpirun
which orted: /opt/openmpi30/bin/orted
Any help would be greatly appreciated.