I have the following code fragment and am experimenting with features of the new Kepler Architecture. The kernel is called several times in a loop with fixed NUM_ITERATIONS. Do you think shifting the loop into a parent kernel would help i.e., is the kernel overhead lesser when invoked from the GPU as compared to the CPU?
Would it be possible to use Dynamic Parallelism to increase performance of the algorithm below? If so, could you suggest a similar use case for dynamic parallelism that would help me implement it in my own program?
for (i = 0; i < NUM_ITERATIONS; i++)
{
kernelGPU<<<gridSize, blkSize>>>(
d_a,
d_b,
d_c,
d_d,
d_e,
R,
V,
N
);
}