I am trying to optimize the performance of a MPI+CUDA benchmark called LAMMPS (https://github.com/lammps/lammps). Right now I am running with two MPI processes and two GPUs. My system has two sockets and each socket connects to 2 K80. Since each K80 contains 2 GPUs internally, each socket actually connects to 4 GPUs. But I am only using 2 cores in one socket and 2 GPUs (1 K80) connected to that socket. The MPI compiler is MVAPICH2 2.2rc1 and the CUDA compiler version is 7.5.
That was the background. I profiled the application and found the communication was the performance bottleneck. And I suspect it is because no GPUDirect technique was applied. So I switch to MVAPICH2-GDR 2.2rc1 and installed all other required libraries and tools. But MVAPICH2-GDR requires Infiniband interface card which is not available on my system, so I have runtime error "channel initialization failed. No active HCAs found on the system". Based on my understanding, the Infiniband card is not required if we only want to use the GPUs within 1 K80 on one node, because K80 has an internal PCIe switch for those two GPUs. These are my doubts. To make the questions clear, I list them as follows:
In my system, one socket connects to two K80. If two GPUs in one K80 need to communicate with the GPUs in another K80, then we must have IB card if we want to use GPUDirect, is that right?
If we only need to use the two GPUs within 1 K80, then the communication between these two GPUs does not require IB card, right? However, MVAPICH2-GDR requires at least one IB card. So is there any workaround to solve this issue? Or I have to plugin a IB card on the system?