Considering the following case:
//thread 0 on device 0:
cudaMemcpyAsync(Dst0, Src0, ..., stream0);//stream0 is on Device 0;
...
//thread 1 on device 1:
cudaMemcpyAsync(Dst1, Src1, ..., stream1);//stream1 is on Device 1;
Can the two memcpy operations occur concurrently and get doubled host-device bandwidth (as long as the host memory bandwidth is sufficient)? if the answer is yes the is there an upper limit of such concurrency?
I plan to write some program for many (6-8) GPUs in a single compute node, so that will be quite critical for performance.