(First of all, I want to thank Hristo Iliev. He helps me a lot in my current MPI project.)
The problem is that MPI_Irecv will stuck sometimes (stuck probability is close to 1/2). My program is more than 20,000 lines. So I cannot list it here.
The code where it stuck is that:
MPI id:0
MPI_Ssend(id=1,tag=1);
MPI_Ssend(id=1,tag=x);
MPI_Recv(id=1,tag=x+1);
MPI id:1
MPI_Recv(id=0,tag=1);
pthread_create(fun_A());
void fun_A()
{
MPI_Recv(id=0,tag=x);
MPI_Ssend(id=0,tag=x+1);
}
In order to debug it, I added some flags after each MPI functions. There flags include printf and write some flags to file.
Several points of my program are lited below.
1.(important) When I run my mpi program in 1 machine using 2 cores, it is OK. But when I run it in 2 machines (each machine using 1 core), sometimes , in MPI id:1, MPI_Ssend(id=0,tag=x+1)(and MPI_wait()) is returned but MPI id:0 stuck at MPI_Recv(id=1,tag=x+1).
2.(important) When MPI_Recv(id=1,tag=x+1);(MPI id:0) stucks, the first 2 MPI_functions in MPI_id:1 should have finished. But sometimes there is no flags of MPI_id:1 at all, sometime there are flags of all 3 MPI functions of MPI id:1.
3.(important) There is no sender thread, when it stuck at MPI_Recv(id=1,tag=x+1); in MPI id:0.
4.vfork is used in my program to fork other jobs. MPI functions are not used in these jobs. These jobs use message queue to communicate with a thread in MPI_comm_world.
5.I enabled multiple-thread support while config. MPI_Init_thread(mutiple_thread support) is used to Init MPI. Ret-value of it is checked.
I do not what's going on about my program. I guess:
There is bug in openMPI
There is error in config.
There is bug in my program. (But if there is bug in my program, why it is OK when I run MPI at 1 machine using 2 cores but failed when at 2 machines each using 1 core).
Could anyone give me any hints?
the out put of ifconfig -a is: One node ip is 10.1.1.112. The other is 10.1.1.113. The out put of ifconfig is exactly the same except ip-addr.
eth0 Link encap:Ethernet HWaddr 00:21:5E:2F:62:8A
inet addr:10.1.1.113 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: 2001:da8:203:eb1:221:5eff:fe2f:628a/64 Scope:Global
inet6 addr: fe80::221:5eff:fe2f:628a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3402577 errors:0 dropped:0 overruns:0 frame:0
TX packets:208064 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:291778729 (278.2 MiB) TX bytes:25343147 (24.1 MiB)
eth1 Link encap:Ethernet HWaddr 00:21:5E:2F:62:8C
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1770 errors:0 dropped:0 overruns:0 frame:0
TX packets:1770 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:798595 (779.8 KiB) TX bytes:798595 (779.8 KiB)