Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x
child kernels in different streams to maximize parallel throughput. We then wait for those children with cudaDeviceSynchronize()
as the parent kernel needs to see the changes made to global
memory.
Now lets say we also wish to start n
parent kernels with streams and, between each set of n
parent kernels we wish to start in parallel, we also must wait for results using cudaDeviceSynchronize()
How would this behave?
From this official introduction to Dynamic Parallelism by Nvidia i would think that parent kernel[0]
would only wait for the streams started within it. is this correct? If not, what happens?
NOTE: i am aware that only so many streams can run at once (32 in my case) but this is more to maximize occupancy
EDIT: a little code sample
__global__ void child_kernel (void) {}
__global__ void parent_kernel (void)
{
if (blockIdx.x == 0)
{
cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
child_kernel <<<1,10,0,s>>> ();
cudaStreamDestroy(s);
}
cudaDeviceSynchronize();
}
for (int i=0; i<10; i++)
{
cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
parent_kernel <<<10,10,0,s>>> ();
cudaStreamDestroy(s);
}
cudaDeviceSynchronize();