How many cudaMemcpyAsync operations can be done concurrently?

Question

Considering the following case:

//thread 0 on device 0:
cudaMemcpyAsync(Dst0, Src0, ..., stream0);//stream0 is on Device 0;

...
//thread 1 on device 1:
cudaMemcpyAsync(Dst1, Src1, ..., stream1);//stream1 is on Device 1;

Can the two memcpy operations occur concurrently and get doubled host-device bandwidth (as long as the host memory bandwidth is sufficient)? if the answer is yes the is there an upper limit of such concurrency?

I plan to write some program for many (6-8) GPUs in a single compute node, so that will be quite critical for performance.

score 3 · Answer 1 · answered Nov 23 '13 at 03:30

Only one cudaMemcpy operation can actually be transferring data at any given time, per direction, across a PCIE link. Other operations can be queued up of course, but only one can be actually using the link.

Some CUDA GPUs have one DMA engine, others have 2. Those that have two can be simultaneously transferring data in both directions. That is the limit of concurrency in data transfers: one per direction (assuming dual copy engines).

Assuming the data transfers are not trivially small, then for the duration of each transfer, the PCIE bus will be fully utilized (in that direction).

Could you comment on the last part of question poster ? `I plan to write some program for many (6-8) GPUs in a single compute node, so that will be quite critical for performance` I want to know because would be trying something like that in near future or a link to any paper which has already analysed this ? — Sagar Masuti, Nov 23 '13 at 15:24

How many cudaMemcpyAsync operations can be done concurrently?

1 Answers1