1

In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not?

Considering the following example:

cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0;

cudaStream_t stream1;

/...creating stream1.../

somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1;

In the above code, can the compiler ensure somekernel always launches AFTER cudaMemcpy finishes or will somekernel execuate concurrently with cudaMemcpy?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
user2188453
  • 1,105
  • 1
  • 12
  • 26

1 Answers1

5

cudaMemcpy call is (in all but a particular case) a synchronous call. The host thread running that code blocks until the memory transfer to the host. It cannot proceed to launch the kernel until the cudaMemcpy call has returned, it that doesn't happen until the copy operation is completed.

More generally, the default stream (0 or null) implicitly serializes operations on the GPU whenever an operation is active in that stream. If you create streams and push operations into them at the same time as an operation is being performed in default stream, all concurrency in those streams is lost until the default stream is idle.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    It's important to note that NVIDIA changed the definition of the NULL (0) stream for the device runtime. For the host runtime, the NULL stream forces joins on all outstanding streams, so generally is only useful for CPU/GPU concurrency. For the device runtime, the NULL stream is "its own" stream; other streams can run concurrently with it and synchronization must be enforced explicitly with cudaStreamWaitEvent(). – ArchaeaSoftware Aug 27 '13 at 20:56