I am trying to run a simple MPI example on a cluster with multiple computing nodes. Now I am just using two test nodes, including gpu8
and gpu12
.
What I've done include:
gpu8
andgpu12
have the correct MPI environment (OpenMPI-4.0.1). I can successfully run the MPI example on a single node.- Passwordless login between
gpu8
andgpu12
has been setup. They can ssh to another node with no issues. - There is a hostfile on each node containing
gpu8
gpu12
- The executable files are under the same path.
echo $PATH
(on both nodes) gives
/home/user_1/share/local/openmpi-4.0.1/bin:xxxxxx
echo $LD_LIBRARY_PATH
(on both nodes) gives
/home/t716/shshi/share/local/openmpi-4.0.1/lib:
The ORTE problem:
I am running mpirun -np 2 --hostfile /home/user_2/hosts ./home/user_2/mpi-hello-world/mpi_hello_world
. The error output is:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------