2

I am trying to use streams to run H2D copy and kernel run in parallel. To do this, I created 2 streams with cudaStreamNonBlocking flag. In a loop I do the following pseudocode:

// pseudocode
cudaMemcpy(data[0]);
streamIdx = 0;

while(1)
{
    // prepare for next loop
    cudaMemcpyAsync(dData[!streamIdx], hData[!streamIdx], 
        stream[!streamIdx]);
    // run current loop
    cudaStreamSynchronize(stream[streamIdx]);
    kernel1<stream[streamIdx]>();
    kernel2<stream[streamIdx]>();
    streamIdx = !streamIdx;
}

The host memory is pinned. The result is that each second cudaMemcpyAsync is delayed and this causes the code to run slower. See diagram: enter image description here

I managed to trick the copy engine by running a fake kernel on the same stream of the copy just to make it run immediately.

Is there any NORMAL way to make the GPU execute the cudaMemcpyAsync immediately?

I use GeForce GTX 1060 6GB GPU

einpoklum
  • 118,144
  • 57
  • 340
  • 684
dmitryvolk
  • 331
  • 3
  • 3
  • 2
    This may be wddm command batching. Are you on windows? If so there is no *normal* way to avoid it. Abnormal ways include restructuring your code, and/or using additional, perhaps unnecessary commands, such as an additional kernel launch or an addition cudaAPI call like cudaStreamQuery, to get the command queue to flush when you want it to. Or you could switch to linux, or you could switch to a GPU that can be placed into TCC mode on windows (the GeForce gpus cannot, excepting titans). – Robert Crovella Nov 07 '16 at 14:48
  • I do use windows. cudaStreamQuery worked to flush the queue. Thanks – dmitryvolk Nov 09 '16 at 12:37
  • How was the host-side memory allocated? – einpoklum Oct 26 '17 at 23:52

1 Answers1

1

cudaStreamSynchronize forces the CPU to block waiting for the stream to be idle. In your case, the CPU has no need to block, instead just keep feeding the GPU.

Restructure your code like this:

while(1)
{
    // prepare for next loop
    cudaMemcpyAsync(dData[streamIdx], hData[streamIdx], stream[streamIdx]);
    // run current loop
    kernel1<stream[streamIdx]>();
    kernel2<stream[streamIdx]>();
    streamIdx = !streamIdx;
}
zfz
  • 11
  • 3