3

I'm having trouble running an OpenMPI program using only two nodes (one of the nodes is the same machine that is executing the mpiexec command and the other node is a separate machine).

I'll call the machine that is running mpiexec, master, and the other node slave.

On both master and slave, I've installed OpemMPI in my home directory under ~/mpi

I have a file called ~/machines.txt on master.

Ideally, ~/machines.txt should contain:

master
slave

However, when I run the following on master:

mpiexec -n 2 --hostfile ~/machines.txt hostname

OUTPUT, I get the following error:

bash: orted: command not found

But if ~/maschines.txt only contains the name of the node that the command is running on, it works. ~/machines.txt:

master

Command:

mpiexec -n 2 --hostfile ~/machines.txt hostname

OUTPUT:

master
master

I've tried running the same command on slave, and changed the machines.txt file to contain only slave, and it worked too. I've made sure that my .bashrc file contains the proper paths for OpenMPI.

What am I doing wrong? In short, there is only a problem when I try to execute a program on a remote machine, but I can run mpiexec perfectly fine on the machine that is executing the command. This makes me believe that it's not a path issue. Am I missing a step in connecting both machines? I have passwordless ssh login capability from master to slave.

clarity
  • 368
  • 1
  • 4
  • 14
  • If you installed MPI under `~/mpi`, then I am guessing you have added `~/mpi` to your `PATH` inside `.bashrc` or something. Do not assume that `.bashrc` is loaded on each machine that MPI is run. – merlin2011 Apr 08 '14 at 00:55
  • Yes, I added bin to PATH and lib LD_LIBRARY_PATH for both machines. – clarity Apr 08 '14 at 20:30

4 Answers4

3

This error message means that you either do not have Open MPI installed on the remote machine, or you do not have your PATH set properly on the remote machine for non-interactive logins (i.e., such that it can't find the installation of Open MPI on the remote machine). "orted" is one of the helper executables that Open MPI uses to launch processes on remote nodes -- so if "orted" was not found, then it didn't even get to the point of trying to launch "hostname" on the remote node.

Note that there might be a difference between interactive and non-interactive logins in your shell startup files (e.g., in your .bashrc).

Also note that it is considerably simpler to have Open MPI installed in the same path location on all nodes -- in that way, the prefix method described above will automatically add the right PATH and LD_LIBRARY_PATH when executing on the remote nodes, and you don't have to muck with your shell startup files.

Note that there are a bunch of FAQ items about these kinds of topics on the main Open MPI web site.

Jeff Squyres
  • 744
  • 4
  • 6
  • This is what my .bashrc file looks like on master: http://pastebin.com/JTCZzpWs This is what it looks like on slave: http://pastebin.com/TDSZiFUz I don't see anything wrong here. Do you? I'm using Ubuntu. – clarity Apr 09 '14 at 23:43
  • Are you 100% sure that your $HOME/.bashrc is being executed when you run non-interactive ssh commands? E.g., "ssh master uptime" and "ssh slave uptime"? You might want to put echo statements in your $HOME/.bashrc's to verify. – Jeff Squyres Apr 10 '14 at 10:00
  • Yeah .bashrc seems to be executed because I echo "Welcome" at the top of the file now and it says "Welcome bash: orted: command not found" now though. – clarity Apr 10 '14 at 20:19
  • However, when I put echo "Welcome" at the bottom of the .bashrc file, it doesn't output "Welcome". hmmm... – clarity Apr 10 '14 at 20:21
  • I just moved all of my "export"s to the top of my .bashrc files. Now there is no 'orted' issue. Now it is just not outputting anything at all. – clarity Apr 10 '14 at 20:29
  • It might be a firewall issue according to this post: http://stackoverflow.com/questions/6632847/openmpi-1-4-3-mpirun-hostfile-error. Any other ideas? Thanks a lot for the help by the way, I appreciate it. – clarity Apr 10 '14 at 20:36
  • 1
    If moving the export statements to the top of your .bashrc works, it means that there's more in your .bashrc than you put in the pastebin outputs. There may be a difference between interactive and non-interactive logins in your .bashrc -- that's the likely culprit (and why moving them up works). As for no output, yes, firewalls can be an issue, see: http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems – Jeff Squyres Apr 11 '14 at 00:59
  • I finally got it to work. I found out that I installed it incorrectly. When I install it incorrectly and execute mpiexec (without any options), some short output pops up saying to use -np option. I was building from source using ./configure --prefix=$DIR. I noticed that when I build/install openmpi when $DIR does not already exist, I'm stuck in this situation I've been in. However, when I use --prefix=$DIR to a $DIR that actually exists, everything works fine. When it works fine, I run mpiexec (no options) and I get the help menu. – clarity Apr 11 '14 at 06:55
  • FWIW, ./configure should perform identically regardless of whether your --prefix $DIR exists already or not (i.e., "make install" will create $DIR if it doesn't already exist). I think you have found a symptom, but not the real cause. But if it works, that may be sufficient. – Jeff Squyres Apr 11 '14 at 12:11
  • I got it to fully work now. I was having problems still with the remote machine not picking up where orted is located. I copied all of the openmpi/bin/* files to /bin/ and now it can find orted (finds it in /bin/ instead). – clarity Aug 04 '14 at 18:04
1

Either explicitly set the absolute OpenMPI prefix with the --prefix option:

prompt> mpiexec --prefix=$HOME/mpi ...

or invoke mpiexec with the absolute path to it:

prompt> $HOME/mpi/bin/mpiexec ...

The latter option sets the prefix automatically. The prefix is then used to set PATH and LD_LIBRARY_PATH on the remote machines.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • 1
    I tried both methods. It works fine for executing on the same machine, but when I try executing on another machine, it waits about 5 seconds then it stops without any output. I'm running the "hostname" program so I'm expecting the hostname of the remote machine there, but it doesn't appear. It seems like it's logging into the other machine because it does wait 5 seconds before terminating. – clarity Apr 08 '14 at 20:32
0

This answer comes very late but for linux users, it is a bad habit to add the environment variables at the end of the ~/.bashrc file, because carefully looking at the top, you will notice an if function exiting if in non-interactive mode, which is precisely what you do compiling your program through the ssh host. So put your environment variables at the TOP of the file, before this exiting if

Joachim
  • 490
  • 5
  • 24
-1

try edit the file

/etc/environment

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/hadoop/openmpi_install/bin" LD_LIBRARY_PATH=/home/hadoop/openmpi_install/lib

Community
  • 1
  • 1