0

I am facing difficulty in executing MPI program on two machines. The OS is Ubuntu 12.04. And the MPI implementation is MPICH2

ssh is working fine:

  root@ubuntu:/home# ssh 192.168.1.9
root@gpuguy's password: 
Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-29-generic i686)

 * Documentation:  https://help.ubuntu.com/

131 packages can be updated.
67 updates are security updates.

Last login: Thu Oct 24 17:36:25 2013 from ubuntu.local
root@gpuguy:~# 

But when I run my MPI programs it fails:

root@ubuntu:/home# mpiexec -f hosts.cfg -n 4 hello
root@192.168.1.9's password:
[proxy:0:0@gpuguy] HYDU_sock_connect (./utils/sock/sock.c:171): unable to get host address for ubuntu (1)
[proxy:0:0@gpuguy] main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at port 42104 (check for firewalls!)

I have already disabled firewall on both machines that is the reason I can do ssh successfully. But how to solve this issue?

My MPI code runs successfully on single machine.

rene
  • 41,474
  • 78
  • 114
  • 152
gpuguy
  • 4,607
  • 17
  • 67
  • 125

2 Answers2

2

For MPICH (or any MPI implementation) to work, you need to have passwordless SSH set up. I should also mention that you really shouldn't have to be logged in as root to make this work. It's generally a very bad idea to be logged in as root all of the time.

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59
  • i have setup passwordless ssh but when i run mpirun command i get an error message "[proxy:0:0@gauss-mic0] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "gauss-mic0" to "127.0.1.1" (Connection refused) [proxy:0:0@gauss-mic0] main (./pm/pmiserv/pmip.c:396): unable to connect to server 127.0.1.1 at port 42947 (check for firewalls!) " – debonair Jan 31 '14 at 19:03
  • If you have another question, you'll need to post it separately rather than trying to do everything through the comments. – Wesley Bland Jan 31 '14 at 20:01
0

In /etc/hosts file, add ip address of each server and its hostname. You should do this for all the servers.

for example:

10.10.0.5    server1
10.10.0.6    server2
10.10.0.7    server3

Just check in /etc/hosts file, not use tab (\t) instead of space to separate between ip address and hostname.

This is wrong:

10.10.0.5 \t server1

This is true:

10.10.0.5    server1

Be careful to not delete or modify existed lines in /etc/hosts file. only add new lines at end of file.

Also, you do not need to disable firewall to fix this issue.

Mohsen
  • 11
  • 3