0

Is it necessary to call cudaDeviceSynchronize after a CUB class is invoked from a CUDA kernel? When one uses say DeviceReduce::Sum() from the device, there are implicit memory copies that block the device from moving on, but after experiencing some instability with using the following code called on the GPU:

__device__ void calcMonomerFlux(double* fluxes, double* lengths, double* dt) //temp2 temp1
{

    void *d_temp_storage = NULL;
    size_t temp_storage_bytes = 0;

    arrInitToLengths<<< numBlocks, numThreads >>>(lengths); 
    cudaDeviceSynchronize();
    arrMult<<< numBlocks, numThreads >>>(fluxes, lengths, lengths);
    cudaDeviceSynchronize();
    double sum = 0;

    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, lengths, lengths, maxlength);
    //cudaDeviceSynchronize();

    cudaMalloc(&d_temp_storage, temp_storage_bytes);
    //cudaDeviceSynchronize();

    cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, lengths, lengths, maxlength);
    //cudaDeviceSynchronize();

    cudaFree(d_temp_storage);

}

1 Answers1

0

Yes, cudaDeviceSynchronize() is required after each CUB call. Note where the commented synch calls are in the question. I spent many hours tracking down why my sums were not calculating correctly or even consistently. Eventually, I found while marching through the NSIGHT debugger that only when I put breakpoints after each CUB function that the calculations would be correct.

  • CUB, when called from device code, is using CUDA Dynamic Parallelism, i.e. it is spawning child kernels. Like any kernel launch, these kernel launches are asynchronous to the calling thread, and are not guaranteed to be complete when control is returned to the calling thread. Therefore if the calling thread requires that the data produced by the CUB call be complete and ready, it is necessary to synchronize and wait for the child kernel(s) to complete. This idea is true in general for data produced by child kernels via CUDA dynamic parallelism. – Robert Crovella Aug 28 '14 at 12:22