Better or the same: CPU memcpy() vs device cudaMemcpy() on pinned, mapped memory in CUDA?

Question

I have:

Host memory that has been successfully pinned and mapped using cudaHostAlloc(..., cudaHostAllocMapped) or cudaHostRegister(..., cudaHostRegisterMapped);
Device pointers have been obtained using cudaHostGetDevicePointer(...).

I initiate cudaMemcpy(..., cudaMemcpyDeviceToDevice) on src and dest device pointers that point to two different regions of pinned+mapped memory obtained by the technique above. Everything works fine.

Question: should I continue doing this or just use a traditional CPU-style memcpy() since everything is in system memory anyway? ...or are they the same (i.e. does cudaMemcpy map to a straight memcpy when both src and dest are pinned)?

(I am still using the cudaMemcpy method because previously everything was in device global memory, but have since switched to pinned memory due to gmem size constraints)

It's an interesting question. Provided you use an optimized memcpy, the CPU is probably better - the memory belongs to it, after all - and a discrete GPU's ability to do host->host memcpy is limited to PCIe bandwidth. But if the GPU would be idle otherwise, why not? — ArchaeaSoftware, Sep 18 '12 at 05:12
I hope the GPU wouldn't be doing the copy. I hope the runtime would see that the pointers are both host pointers and invoke a host memcpy. I have asked to find out what actually happens. — harrism, Sep 18 '12 at 06:41

score 3 · Accepted Answer · answered Sep 18 '12 at 10:44

With cudaMemcpy the CUDA driver detects that you are copying from a host pointer to a host pointer and the copy is done on the CPU. You can of course use memcpy on the CPU yourself if you prefer.

If you use cudaMemcpy, there may be an extra stream synchronize performed before doing the copy (which you may see in the profiler, but I'm guessing there—test and see).

On a UVA system you can just use cudaMemcpyDefault as talonmies says in his answer. But if you don’t have UVA (sm_20+ and 64-bit OS), then you have to call the right copy (e.g. cudaMemcpyDeviceToDevice). If you cudaHostRegister() everything you are interested in then cudaMemcpyDeviceToDevice will end up doing the following depending on the where the memory is located:

Host <-> Host: performed by the CPU (memcpy)
Host <-> Device: DMA (device copy engine)
Device <-> Device: Memcpy CUDA kernel (runs on the SMs, launched by driver)

very interestig, do you have some source where you found the infos? — chris-kuhr, Jun 02 '17 at 20:47
I believe I asked my NVIDIA colleagues for implementation details. — harrism, Jun 07 '17 at 03:48

score 2 · Answer 2 · answered Sep 17 '12 at 08:08

2

If you are working on a platform with UVA (unified virtual addressing), I would strongly suggest using cudaMemcpy with cudaMemcpyDefault. That way all of this handwringing about the fastest path becomes an internal API implementation detail you don't have to worry about.

answered Sep 17 '12 at 08:08

talonmies

70,661
34
192
269

Yes and no, I often work on a C1060, but have access to C2050/70's. So what about in regards to my pinned memory question specifically--do you know what `cudaMemcpyDefault` does behind the scenes in this case? That would answer the question pretty much. – mikepcw Sep 17 '12 at 15:47
I don't work.for NVIDIA so I haven't seen any code, but it appears look at the source and destination pointers and at accordingly. You will get a host side copy with a host pointer and a device to device copy with a device pointer – talonmies Sep 17 '12 at 16:22

Better or the same: CPU memcpy() vs device cudaMemcpy() on pinned, mapped memory in CUDA?

2 Answers2