13
__global__ void helloCUDA(float f)
{
    printf("Hello thread %d, f=%f\n", threadIdx.x, f);
}

int main()
{
    helloCUDA<<<1, 5>>>(1.2345f);
    cudaDeviceSynchronize();
    return 0;
}

Why is cudaDeviceSynchronize(); at many places for example here it is not required after kernel call?

therealrootuser
  • 10,215
  • 7
  • 31
  • 46
gpuguy
  • 4,607
  • 17
  • 67
  • 125

1 Answers1

24

A kernel launch is asynchronous. This means it returns control to the CPU thread immediately after starting up the GPU process, before the kernel has finished executing.

So what is the next thing in the CPU thread here? Application exit.

At application exit, it's ability to send output to the standard output is terminated by the OS.

Thus the output that is generated later by the kernel has nowhere to go, and you won't see it.

On the other hand, if you use cudaDeviceSynchronize(), then the kernel is guaranteed to finish (and the output from the kernel will find a waiting standard output queue), before the application is allowed to exit.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • but if you see the link i have posted, immediately after kernal call we are calling cudamemcpy(....device to host). why we do not have cudadevicesynchronize() here? – gpuguy Oct 05 '13 at 03:13
  • 9
    Because CUDA operations (API calls, kernel calls) issued to the same stream, even if asynchronous, are guaranteed to execute serially. So since the kernel and the cudaMecpy operation are in the same (default) stream, the cudaMemcpy is guaranteed not to begin until the kernel is completed, even though the kernel launch is asynchronous (with respect to the host thread). – Robert Crovella Oct 05 '13 at 03:29
  • so if there is no cudaMemcpy the program would terminate?? – KansaiRobot Jun 14 '21 at 10:30