When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs.
The larger question, of course, is: is it worth it?
__global__ void kernel(const int * vec, float * outvec){
int idx = threadIdx.x;
float random_sum=0;
for(int j=0; j<vec[idx]; j++){
random_sum+=threadsafe_rand_uniform();
}
outvec[idx] = random_sum;
}
Ok, this example is kind of contrived. The point, though, is that if you have a loop of different length from thread to thread, it's tempting to try and dynamically parallelize it. However, if the warp is still 32, you're going to end up wasting a lot of processors on warps of uneven sizes. In this particular example, you may want to sort the data first and then dispatch the dynamically parallelizable indexes in one kernel and the poorly shaped indexes in a different one.