I am trying to use streams to run H2D copy and kernel run in parallel. To do this, I created 2 streams with cudaStreamNonBlocking flag. In a loop I do the following pseudocode:
// pseudocode
cudaMemcpy(data[0]);
streamIdx = 0;
while(1)
{
// prepare for next loop
cudaMemcpyAsync(dData[!streamIdx], hData[!streamIdx],
stream[!streamIdx]);
// run current loop
cudaStreamSynchronize(stream[streamIdx]);
kernel1<stream[streamIdx]>();
kernel2<stream[streamIdx]>();
streamIdx = !streamIdx;
}
The host memory is pinned.
The result is that each second cudaMemcpyAsync is delayed and this causes the code to run slower.
See diagram:
I managed to trick the copy engine by running a fake kernel on the same stream of the copy just to make it run immediately.
Is there any NORMAL way to make the GPU execute the cudaMemcpyAsync immediately?
I use GeForce GTX 1060 6GB GPU