I'm having this weird problem which I have no idea how to solve and I would appreciate some help...
I'm running Windows 7 on multiple locally connected machines, which have MPICH (version 1.4.1p1) installed. I have checked that the standard cpi.exe example works on each machine. However, when testing it on multiple machines, I find a weird problem. Suppose I have three machines: localhost, HOST1, HOST2.
If I execute the following commands (from localhost)
mpiexec -n 2 -host HOST1 .\cpi.exe
mpiexec -n 2 -host HOST2 .\cpi.exe
mpiexec -n 2 -host HOST1 .\cpi.exe : -n 2 -host HOST2 .\cpi.exe
then they executes fine. However, if I swap the order of the hosts around from the last one, i.e
mpiexec -n 2 -host HOST2 .\cpi.exe : -n 2 -host HOST1 .\cpi.exe
then I get the following error:
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1478)......................: MPI_Bcast(buf=0018FE48, count=1, MPI_INT
, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1321).................:
MPIR_Bcast_intra(1119)................:
MPIR_Bcast_scatter_ring_allgather(962):
MPIR_Bcast_binomial(213)..............: Failure during collective
MPIR_Bcast_scatter_ring_allgather(955):
MPIR_Bcast_binomial(189)..............:
MPIC_Send(66).........................:
MPIC_Wait(540)........................:
MPIDI_CH3I_Progress(402)..............:
MPID_nem_mpich2_blocking_recv(905)....:
MPID_nem_newtcp_module_poll(37).......:
MPID_nem_newtcp_module_connpoll(2656).:
gen_cnting_fail_handler(1739).........: connect failed - The semaphore timeout p
eriod has expired.
(errno 121)
In this latter case, if I turn the firewall off on HOST2, then it works. Unfortunately I have very little experience with firewalls and networking in general so I don't know how to resolve this.
The only thing I can figure out is that it's failing on the first collective MPI call (broadcast).
Please help!