4

Suppose I want to perform an async memcpy host to device in CUDA, then immediately run the kernel. How can I test in the kernel if the async transfer has completed ?

Stefano Borini
  • 138,652
  • 96
  • 297
  • 431

1 Answers1

8

Sequencing your asynchronous copy and kernel launch using a CUDA "stream" ensures that the kernel executes after the asynchronous transfer has completed. The following code example demonstrates:

#include <stdio.h>

__global__ void kernel(const int *ptr)
{
  printf("Hello, %d\n", *ptr);
}

int main()
{
  int *h_ptr = 0;

  // allocate pinned host memory with cudaMallocHost
  // pinned memory is required for asynchronous copy
  cudaMallocHost(&h_ptr, sizeof(int));

  // look for thirteen in the output
  *h_ptr = 13;

  // allocate device memory
  int *d_ptr = 0;
  cudaMalloc(&d_ptr, sizeof(int));

  // create a stream
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // sequence the asynchronous copy on our stream
  cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);

  // sequence the kernel on our stream after the copy
  // the kernel will execute after the copy has completed
  kernel<<<1,1,0,stream>>>(d_ptr);

  // clean up after ourselves
  cudaStreamDestroy(stream);
  cudaFree(d_ptr);
  cudaFreeHost(h_ptr);
}

And the output:

$ nvcc -arch=sm_20 async.cu -run
Hello, 13

I don't believe there's any supported way to test from within a kernel whether some asynchronous condition (such as the completion of an asynchronous transfer) has been met. CUDA thread blocks are assumed to execute completely independently from other threads of execution.

Jared Hoberock
  • 11,118
  • 3
  • 40
  • 76
  • ok so this means that there's no way to achieve what I want. That's bad, because if I have a kernel where a first part can be executed without accessing the memcpy area, and a second part that requires it, I cannot start the kernel until the async copy is performed. This forces me to do the first part on the CPU. – Stefano Borini Aug 11 '11 at 22:18
  • 1
    Is it possible to partition the problem into a dependent and independent portion? If so, it might be possible to split the computation into two kernel launches -- one that depends on the asynchronous transfer, and one which doesn't. – Jared Hoberock Aug 11 '11 at 22:21
  • that's also a possibility. I don't have any actual problem under consideration, so I cannot answer your question. I am just studying and I got that question. – Stefano Borini Aug 11 '11 at 22:42