Suppose I want to perform an async memcpy host to device in CUDA, then immediately run the kernel. How can I test in the kernel if the async transfer has completed ?
Asked
Active
Viewed 4,373 times
1 Answers
8
Sequencing your asynchronous copy and kernel launch using a CUDA "stream" ensures that the kernel executes after the asynchronous transfer has completed. The following code example demonstrates:
#include <stdio.h>
__global__ void kernel(const int *ptr)
{
printf("Hello, %d\n", *ptr);
}
int main()
{
int *h_ptr = 0;
// allocate pinned host memory with cudaMallocHost
// pinned memory is required for asynchronous copy
cudaMallocHost(&h_ptr, sizeof(int));
// look for thirteen in the output
*h_ptr = 13;
// allocate device memory
int *d_ptr = 0;
cudaMalloc(&d_ptr, sizeof(int));
// create a stream
cudaStream_t stream;
cudaStreamCreate(&stream);
// sequence the asynchronous copy on our stream
cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);
// sequence the kernel on our stream after the copy
// the kernel will execute after the copy has completed
kernel<<<1,1,0,stream>>>(d_ptr);
// clean up after ourselves
cudaStreamDestroy(stream);
cudaFree(d_ptr);
cudaFreeHost(h_ptr);
}
And the output:
$ nvcc -arch=sm_20 async.cu -run
Hello, 13
I don't believe there's any supported way to test from within a kernel whether some asynchronous condition (such as the completion of an asynchronous transfer) has been met. CUDA thread blocks are assumed to execute completely independently from other threads of execution.

Jared Hoberock
- 11,118
- 3
- 40
- 76
-
ok so this means that there's no way to achieve what I want. That's bad, because if I have a kernel where a first part can be executed without accessing the memcpy area, and a second part that requires it, I cannot start the kernel until the async copy is performed. This forces me to do the first part on the CPU. – Stefano Borini Aug 11 '11 at 22:18
-
1Is it possible to partition the problem into a dependent and independent portion? If so, it might be possible to split the computation into two kernel launches -- one that depends on the asynchronous transfer, and one which doesn't. – Jared Hoberock Aug 11 '11 at 22:21
-
that's also a possibility. I don't have any actual problem under consideration, so I cannot answer your question. I am just studying and I got that question. – Stefano Borini Aug 11 '11 at 22:42