Synchronizing depth of nested kernels

Question

Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then wait for those children with cudaDeviceSynchronize() as the parent kernel needs to see the changes made to global memory.

Now lets say we also wish to start n parent kernels with streams and, between each set of n parent kernels we wish to start in parallel, we also must wait for results using cudaDeviceSynchronize()

How would this behave?

From this official introduction to Dynamic Parallelism by Nvidia i would think that parent kernel[0] would only wait for the streams started within it. is this correct? If not, what happens?

NOTE: i am aware that only so many streams can run at once (32 in my case) but this is more to maximize occupancy

EDIT: a little code sample

__global__ void child_kernel (void) {}
__global__ void parent_kernel (void) 
{
    if (blockIdx.x == 0)
    {
        cudaStream_t s;
        cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

        child_kernel <<<1,10,0,s>>> ();
        cudaStreamDestroy(s);
    }
    cudaDeviceSynchronize();
}
for (int i=0; i<10; i++)
{
    cudaStream_t s;
    cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

    parent_kernel <<<10,10,0,s>>> ();
    cudaStreamDestroy(s);
}
cudaDeviceSynchronize();

Robert Crovella · Accepted Answer · 2018-05-23T00:24:53.823

Parent kernels will wait on any spawned child kernels to complete, before the parent kernel completes. This is covered in the dynamic parallelism documentation:

The invocation and completion of child grids is properly nested, meaning that the parent grid is not considered complete until all child grids created by its threads have completed. Even if the invoking threads do not explicitly synchronize on the child grids launched, the runtime guarantees an implicit synchronization between the parent and child.

Any other semantics should be inferrable from ordinary stream semantics, that is: an activity launched into a particular stream will not begin until all previous activity launched into that stream has completed. Similarly, there is no enforced ordering between activities launched into separate streams.

In your example (or indeed in any example), the parent kernel will wait until all child kernels launched from that parent kernel have completed, regardless of what streams are used or not used.

It's not clear you were asking about this, but note that for the device code in your example, cudaDeviceSynchronize() only guarantees that that thread will wait for the child kernel to complete, and likewise only enforces results visibility ordering for that thread. If you wish other threads in the same block to be able to witness global memory results from the child kernel spawned by thread 0 (just to pick an example), then you would want to follow up the cudaDeviceSynchronize() operation in thread 0 with a __syncthreads() operation. After that __syncthreads(), other threads in the same block will have guaranteed visibility into the global results produced by the child kernel launched by thread 0 (or by the child kernel launched by any thread, followed by a cudaDeviceSynchronize() call, that precedes the aforementioned __syncthreads()).

A couple of other limits to be aware of in a CDP environment are the nesting limit and the pending launch limit.

thanks didnt know it was thread waits for it's stream not the kernel for all child streams. much appreciated — user2255757, May 23 '18 at 08:46

Synchronizing depth of nested kernels

1 Answers1