I am running a HPC benchmark (IOR - http://sourceforge.net/projects/ior-sio/) on lustre. I compiled the source of IOR and running it with openmpi 1.5.3.
The problem is that it hangs when the number of processes (-np
) is less than 6, which is odd. Removing all other things I do with around, the actual command that I run comes down to this:
/usr/lib64/openmpi/bin/mpirun --machinefile mpi_hosts --bynode -np 16 /path/to/IOR -F -u -t 1m -b 16g -i 1 -o /my/file/system/out_file
Attaching the process to GDB shows that process hangs at MPI_recv:
#0 0x00007f3ac49e95fe in mlx4_poll_cq () from /usr/lib64/libmlx4-m-rdmav2.so
#1 0x00007f3ac6ce0918 in ?? () from /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so
#2 0x000000385a6f0d5a in opal_progress () from /usr/lib64/openmpi/lib/libmpi.so.1
#3 0x00007f3ac7511e05 in ?? () from /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so
#4 0x000000385a666cac in PMPI_Recv () from /usr/lib64/openmpi/lib/libmpi.so.1
#5 0x0000000000404bd7 in CountTasksPerNode (numTasks=16, comm=0x628a80) at IOR.c:526
#6 0x0000000000407d53 in SetupTests (argc=11, argv=0x7fffe61fa2a8) at IOR.c:1402
#7 0x0000000000402e96 in main (argc=11, argv=0x7fffe61fa2a8) at IOR.c:134
This problem happens only when -np
is 2/3/4/5. It works for 1/6/7/8/16 etc.
I can't reproduce this problem if I use simple commands such as date
or ls
. So I am not sure if this is a problem with my environment or IOR binary that I compiled (very unlikely because the same happens with an older/stable IOR binary too).
Also the precise behaviour is observed when using openmpi1.4.3 instead of openmpi1.5.3.
I have also tried by using various number of hosts (--machinefile
argument) and same behaviour is observed for the the above mentioned -np
values.
The source line it hangs is this:
MPI_Recv(hostname, MAX_STR, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &status);
Basically I am looking for clues as to why MPI_recv()
might hang when -np
is 2/3/4/5. Please let me know if other information is needed. Thanks.