cudaStreamDestroy() does not synchronize/block?

Question

I'm using CUDA 4.2 on a Quadro NVS 295 on a Win7 x64 machine. From the CUDA C Programming Manual I read this:

"...Streams are released by calling cudaStreamDestroy().

for (int i = 0; i < 2; ++i)
cudaStreamDestroy(stream[i]);

cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the host thread."

Is this really true? I wrote a small code where I do more or less the following (i'll put only pseudocode):

//transfer input buffer to device
cudaMemcpyToArrayAsync( ... , stream[1]);

//launch kernel
my_kernel <<<dimGrid, dimBlock, 0, stream[1]>>> (...);

//transfer from device to host
cudaMemcpyAsync(.., cudaMemcpyDeviceToHost, stream[1]);

//Destroy stream. In theory this should block the host until everything on the stream is completed!
ret = cudaStreamDestroy(stream[1]);

With this example, it seems that the cudaStreamDestroy() call immediately return to the host, i.e. not waiting for the cudaMemcpyAsync() call and other strem instructions to finish. If I put a "cudaStreamSynchronize(stream[1]);" call befor destroying the stream, everything goes well but slower. So, what I'm doing wrong?

Thank you very much for your responses!

talonmies · Accepted Answer · 2012-06-11T19:26:05.337

I am not sure what version of the documentation you are looking at, but it isn't the same as mine. My CUDA 4.2 documentation says this:

Destroys and cleans up the asynchronous stream specified by stream.

In case the device is still doing work in the stream stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with stream will be released automatically once the device has completed all work in stream.

And, in my experience, that is exactly what it does. The driver waits until the stream is empty and destroys it. But cudaStreamDestroy doesn't block the calling thread.

You can confirm this by running this example:

#include <stdio.h>
#include <assert.h>
#include <unistd.h>

__global__ void kernel(int * inout, const int N)
{
    int gid = threadIdx.x + blockIdx.x * blockDim.x;
    int gstride = gridDim.x * blockDim.x;

   for (; gid < N; gid+= gstride) inout[gid] *= 2;
}

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

int main(void)
{
    const int N = 2<<20, sz = N * sizeof(int);

    int * inputs, * outputs, * _inout;

    gpuErrchk( cudaMallocHost((void **)&inputs, sz) );
    gpuErrchk( cudaMallocHost((void **)&outputs, sz) );
    gpuErrchk( cudaMalloc((void **)&_inout, sz) );

    for(int i=0; i<N; i++) { inputs[i] = i; outputs[i] = 0; }

    cudaStream_t stream[2];
    for (int i = 0; i < 2; i++)
        gpuErrchk( cudaStreamCreate(&stream[i]) );

    gpuErrchk( cudaMemcpyAsync(_inout, inputs, sz, cudaMemcpyHostToDevice, stream[1]) );

    kernel<<<128, 128, 0, stream[1]>>>(_inout, N);
    gpuErrchk(cudaPeekAtLastError());

    gpuErrchk( cudaMemcpyAsync(outputs, _inout, sz, cudaMemcpyDeviceToHost, stream[1]) );

    for(int i = 0; i < 2; i++)
        gpuErrchk( cudaStreamDestroy(stream[i]) );

    sleep(1); // remove the sleep and see what happens....

    for(int i = 0; i < N; i++)
        assert( (2 * inputs[i]) == outputs[i] );

    cudaDeviceReset();

    return 0;
}

The without the sleep() the code will fail, because the GPU isn't finished, but with it, the assert will pass. Note that the sleep is doing something subtly different from using an explicit stream synchronization primitive before the cudaStreamDestroy calls, even if the result is the same. If the stream wasn't empty when it was destroyed, the result check could never pass.

Thank you very much! I was reading exactly the same documentation (4.2) but probably a different chapter/section. I mean, in that section it was not very clear and I think it was a bit misleading. Thank you again! — ACRay, Jun 11 '12 at 20:25

score 2 · Answer 2 · answered Jun 11 '12 at 17:46

CUDA stream is just an execution queue for device tasks. All functions accepting stream only add new task to the queue without waiting the execution result. cudaStreamDestroy is a special task which means that stream needs to be destroyed then all previous device tasks are completed. The words

"cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the host thread."

mean that stream could not be destroyed until your device code is complete.

Thank you! Excellent explanation! I fond the nvidia docs here a bit misleading! — ACRay, Jun 11 '12 at 20:26

cudaStreamDestroy() does not synchronize/block?

2 Answers2