Is cudaMemcpy from host to device executed in parallel?

Question

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?

I other words, it the copy a sequential process or is it done in parallel?

Let me explain why I ask this: I have an array of 5 million elements . Now, I want to copy 2 sets of 50,000 elements from different parts of the array. SO, i was thinking will it be faster to first form a large array of all the elements i want to copy on the CPU and then do just 1 large transfer or should i just call 2 cudaMemcpy, one for each set.

If cudaMemcpy is done in parallel, then i think the 2nd approach will be faster as you dont have to copy 100000 elements sequentially on the CPU first

score 3 · Accepted Answer · answered Jan 13 '12 at 19:40

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?

In the case of the synchronous API call with regular pageable user allocated memory, the answer is it runs on both. The driver must first copy data from the source memory to a DMA mapped source buffer on the host, then signal to the GPU that data is waiting for transfer. The GPU then executes the transfer. The process is repeated as many times as necessary for the complete copy from source memory to the GPU.

The throughput of process can be improved by using pinned memory, which the driver can directly DMA to or from without intermediate copying (although pinning has a large initialization/allocation overhead which needs to be amortised as well).

As to the rest of the question, I suspect that two memory copies directly from the source memory would be more efficient than the alternative, but this is the sort of question that can only be conclusively answered by benchmarking.

3Dave · Answer 2 · 2012-01-13T16:16:36.583

0

I believe a transfer from host to GPU memory is a blocking call. It uses the entire bus and, as such, it doesn't really make sense (even if it was physically possible) to run multiple operations in parallel.

I doubt you'll get any performance gain from concatenating the data before transferring it. The bottleneck will likely be the transfer itself. The copies should be queued and executed sequentially with minimal overhead.

edited Jan 13 '12 at 16:16

answered Jan 13 '12 at 16:06

3Dave

28,657
18
88
151

2

There is an asynchronous version of the memory copy API in CUDA, and all current GPUs can overlap kernel execution with memory transfer. The Fermi compute cards also have a second DMA engine onboard, so they can perform two asynchronous transfers simultaneously - either device to host, or device to device in a multi-gpu system. – talonmies Jan 13 '12 at 16:55
@talonmies nifty! I'm aware that kernel execution and memory transfer can run at the same time, but wasn't aware of the second memory controller. Wouldn't that cut the transfer bandwidth in half, assuming that a single transfer uses all available bandwidth? – 3Dave Jan 13 '12 at 16:59
@DavidLively: So you are suggesting I should use 2 cudaMemcpy instead of concatenating first? – Programmer Jan 13 '12 at 17:03
@talonmies: You mean I should use cudaMemcpyAsync(). One more question: If I have a kernel launch after all these async calls, will the kernel launch wait until all these calls have finished? – Programmer Jan 13 '12 at 17:04
1

@DavidLively: In a single bus with a pair of bandwidth limited host-device transfers, yes. But smaller transfers are often latency rather than bandwidth bound, so having 2 DMA engines can hide some of that latency. Also there other scenarios where some huge wins can happen. If you have two Fermi Tesla's behind a PCI-e switch (like in a C2070), they can stream to one another without consuming host PCI-e bandwidth, and if you have PCI-e infiniband card, it can DMA to the GPU memory directly off the wire *while* the GPU is transferring to the host or another GPU. – talonmies Jan 13 '12 at 17:06
@talonmies: Please answer my comment. – Programmer Jan 13 '12 at 17:10
1

@Programmer You should really, thoroughly read the "CUDA 4.0 Programming Guide", as it answers all these questions about synchronization. THe "CUDA 4.0 Best Practices Guide" gives some additional insight on how to benefit from asynchronous memory copies. To make a long story short, you're not going to get any benefit pursuing this memory stuff until you, personally, and independently, take the time to understand the asynchronous memory API, contexts, streams, and some techniqes from high-performance computing like pipelining, double buffering staging, overlapping computation and communication. – Patrick87 Jan 13 '12 at 20:34

Is cudaMemcpy from host to device executed in parallel?

2 Answers2