I'm trying to learn how to use CUDA Dynamic Parallelism.
I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with only 1 block of 1 thread, like so:
int nItems = 100
parentKernel<<<1,1>>>(nItems);
Now, inside my parent kernel, I create the work, then launch a sub-kernel, like this:
__global__ void parentKernel(int nItems)
{
// create some work
// invoke child kernel
childKernel<<<2, 256>>>();
}
Notice that the child kernel was launched with more threads and blocks (2x256) than the parent kernel was given (1x1).
Will the child kernel actually run 512 threads in parallel? Or does the parent kernel have to divvy out its threads to its children?