I'm using CUDA 4.2 on a Quadro NVS 295 on a Win7 x64 machine. From the CUDA C Programming Manual I read this:
"...Streams are released by calling cudaStreamDestroy().
for (int i = 0; i < 2; ++i)
cudaStreamDestroy(stream[i]);
cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the host thread."
Is this really true? I wrote a small code where I do more or less the following (i'll put only pseudocode):
//transfer input buffer to device
cudaMemcpyToArrayAsync( ... , stream[1]);
//launch kernel
my_kernel <<<dimGrid, dimBlock, 0, stream[1]>>> (...);
//transfer from device to host
cudaMemcpyAsync(.., cudaMemcpyDeviceToHost, stream[1]);
//Destroy stream. In theory this should block the host until everything on the stream is completed!
ret = cudaStreamDestroy(stream[1]);
With this example, it seems that the cudaStreamDestroy() call immediately return to the host, i.e. not waiting for the cudaMemcpyAsync() call and other strem instructions to finish. If I put a "cudaStreamSynchronize(stream[1]);" call befor destroying the stream, everything goes well but slower. So, what I'm doing wrong?
Thank you very much for your responses!