0

I have successfully set up the password less ssh between the servers and my computer. There is a simple openMPI program which is running well on the single computer. But ,unfortunately when i am trying this on a cluster ,neither i am getting a password prompt(as i have set up ssh authorization) nor the execution is moving forward.

Hostfile looks like this,

# The Hostfile for Open MPI

# The master node, 'slots=8' is used because it has 8 cores
  localhost slots=8
# The following slave nodes are single processor machines:
  gautam@pcys13.grm.polymtl.ca slots=8 
  gautam@srvgrm04 slots=160

I am running hello world MPI program on the cluster,

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME]; 
  double t;
  MPI_Init(&argc, &argv);
  t=MPI_Wtime();    
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
  MPI_Finalize();
}

and i am running like this mpirun -np 16 --hostfile hostfile ./hello

when using -d option, the log is like this,

[gautam@pcys33:~/LTE/check ]% mpirun -np 16 --hostfile hostfile -d ./hello
[pcys33.grm.polymtl.ca:02686] procdir: /tmp/openmpi-sessions-gautam@pcys33.grm.polymtl.ca_0/60067/0/0
[pcys33.grm.polymtl.ca:02686] jobdir: /tmp/openmpi-sessions-gautam@pcys33.grm.polymtl.ca_0/60067/0
[pcys33.grm.polymtl.ca:02686] top: openmpi-sessions-gautam@pcys33.grm.polymtl.ca_0
[pcys33.grm.polymtl.ca:02686] tmp: /tmp
[srvgrm04:77812] procdir: /tmp/openmpi-sessions-gautam@srvgrm04_0/60067/0/1
[srvgrm04:77812] jobdir: /tmp/openmpi-sessions-gautam@srvgrm04_0/60067/0
[srvgrm04:77812] top: openmpi-sessions-gautam@srvgrm04_0
[srvgrm04:77812] tmp: /tmp

can you make a inference from the logs ?

Ankur Gautam
  • 1,412
  • 5
  • 15
  • 27
  • Maybe try the `-d` to `mpirun` to get some idea what's happening. – Zulan Jul 12 '13 at 11:16
  • i edited to contain the log when i tried -d option with the run ! – Ankur Gautam Jul 12 '13 at 16:30
  • Are you sure that `hello` exists on all nodes and is located in the same filesystem path? Apparently the ORTE daemon is launching successfully on the second node, although the absence of `pcys13.grm.polymtl.ca` in the log could indicate that there is a problem connecting to it (or is it an alias for `srvgrm04`?) BTW, you don't have to specify the usernames in the hostfile if they are the same as the one on the master host. – Hristo Iliev Jul 20 '13 at 09:16
  • since every node has the same file system with the same authentication, i think hello will exist on all of them.I have password less ssh enabled and can access the other computers via ssh . I have also tried with the hostfile not having the username with the corresponding node. – Ankur Gautam Jul 20 '13 at 14:11
  • Do i assumed to change anything on the code for it to be running on cluster of servers?I used 32 process on a single server and works well.Or if there is anything to be specified for load balance between the nodes ? Please help – Ankur Gautam Jul 22 '13 at 15:23
  • i have got some conclusions regarding the problem.Can you please have a look on that ? http://stackoverflow.com/questions/17820445/openmpi-hello-world-on-cluster – Ankur Gautam Jul 23 '13 at 20:31

1 Answers1

-1

You just need to disable the firewall of each machine