I am using a GPU cluster without GPUDirect support. From this briefing, the following is done when transferring GPU data across nodes:
- GPU writes to pinned sysmem1
- CPU copies from sysmem1 to sysmem2
- Infiniband driver copies from sysmem2
Now I am not sure whether the second step is an implicit step when I transfer sysmem1 across Infiniband using MPI. By assuming this, my current programming model is something like this:
- cudaMemcpy(hostmem, devicemem, size, cudaMemcpyDeviceToHost).
- MPI_Send(hostmem,...)
Is my above assumption true and will my programming model work without causing communication issues?