0

I'm having this weird problem which I have no idea how to solve and I would appreciate some help...

I'm running Windows 7 on multiple locally connected machines, which have MPICH (version 1.4.1p1) installed. I have checked that the standard cpi.exe example works on each machine. However, when testing it on multiple machines, I find a weird problem. Suppose I have three machines: localhost, HOST1, HOST2.

If I execute the following commands (from localhost)

mpiexec -n 2 -host HOST1 .\cpi.exe

mpiexec -n 2 -host HOST2 .\cpi.exe

mpiexec -n 2 -host HOST1 .\cpi.exe : -n 2 -host HOST2 .\cpi.exe

then they executes fine. However, if I swap the order of the hosts around from the last one, i.e

mpiexec -n 2 -host HOST2 .\cpi.exe : -n 2 -host HOST1 .\cpi.exe

then I get the following error:

Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(1478)......................: MPI_Bcast(buf=0018FE48, count=1, MPI_INT , root=0, MPI_COMM_WORLD) failed MPIR_Bcast_impl(1321).................: MPIR_Bcast_intra(1119)................: MPIR_Bcast_scatter_ring_allgather(962): MPIR_Bcast_binomial(213)..............: Failure during collective MPIR_Bcast_scatter_ring_allgather(955): MPIR_Bcast_binomial(189)..............: MPIC_Send(66).........................: MPIC_Wait(540)........................: MPIDI_CH3I_Progress(402)..............: MPID_nem_mpich2_blocking_recv(905)....: MPID_nem_newtcp_module_poll(37).......: MPID_nem_newtcp_module_connpoll(2656).: gen_cnting_fail_handler(1739).........: connect failed - The semaphore timeout p eriod has expired. (errno 121)

In this latter case, if I turn the firewall off on HOST2, then it works. Unfortunately I have very little experience with firewalls and networking in general so I don't know how to resolve this.

The only thing I can figure out is that it's failing on the first collective MPI call (broadcast).

Please help!

queenbee
  • 155
  • 1
  • 7

1 Answers1

0

Ok, solved my own problem - basically I added an exception for the cpi.exe program on HOST1 but not on HOST2. The solution was to ensure the exception was added to BOTH machines!

queenbee
  • 155
  • 1
  • 7