How can I use GPU-DMA from GPU-CUDA code to copying data?

Question

With CUDA SDK 5.5 I can use to copying data:

from host: cudaMemcpy(); to use GPU-DMA if memory pinned
from host: memcpy(); or cudaMemcpy(); to use CPU Cores if memory isn't pinned
from gpu: for() { dst[i] = src[i]; } or memcpy(); to use GPU Cores
from gpu: to use GPU-DMA ???

How can I use GPU-DMA in kernel-function of GPU-CUDA code to copying data?

What is exactly your problem? Accessing the memory space of one GPU from another GPU? In this case, you can use Peer-to-Peer (P2P) GPU memory copies, have a look at the simpleP2P SDK CUDA example. — Vitality, Oct 08 '13 at 08:36
@JackOLantern Problem in that: for Peer-to-Peer (P2P) GPU memory copies I need to use `cudaMemcpy();`, but I can't use it **in the kernel-function** of CUDA-code, as said here: http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GPUDirect_uva.pdf I want to initiate async copying data previously (as prefetch its) from remote GPU-RAM to current GPU-RAM for reduce latency when I will use access to current GPU-RAM instead of remote GPU-RAM. — Alex, Oct 08 '13 at 08:50
If your system supports UVA, then you can address one GPU memory space from another GPU inside a kernel function like `dst[i] = src[i];`. — Vitality, Oct 08 '13 at 12:57
@JackOLantern Yes, this is written in the third point in my question, but bandwidth much less than access to the current global memory. — Alex, Oct 08 '13 at 13:09

score 4 · Accepted Answer · edited Jun 13 '16 at 16:07

What you are trying to do is so not possible from device side if it does not support compute capability 3.5. If you have such a card see edit.

Yes you can access GPU RAM from another device by passing a device pointer allocated on another device to your kernel. Than the execution runtime will provide the requested data onto the right GPU. However, this isn't very efficient because every access to another devices memory results in a memcopy operation either peer-to-peer or device-host-device.

What you can do is to perform prefetch data from within your host code and use different streams for your memcopy operations (cudaMemcpy_async) and your kernel executions. However this works only if you have a decent card with one separated copy unit and you have to do explicit locking because there are no build in structures that will hold your kernel until the data transfer is finished.

EDIT:

If you have a compute capbility 3.5 device you can use the cuda device runtime for memcopy from device to device within your device code. See the dynamic parallelism documentation here: http://docs.nvidia.com/cuda/pdf/cuda_dynamic_parallelism_programming_guide.pdf Note that all memcopy operations on the device are also asynchronous. And you will heave to preserve data coherence again on your own.

Thanks! Does the GPU up through the function `memcpy()` uses DMA, or simply creates a separate STREAM in which it asynchronously does copy: `for () dst [i] = src [i];`? — Alex, Oct 08 '13 at 13:06
The documentation is not realy clear about this. I think you have to do some profiling and look if peer to peer memcopies are invoked when you call cudaMemcpyAsync within your kernel code. Most likely peer to peer access must be enabled first via host code. — Michael Haidl, Oct 08 '13 at 15:47

How can I use GPU-DMA from GPU-CUDA code to copying data?

1 Answers1