-3

I have the following code fragment and am experimenting with features of the new Kepler Architecture. The kernel is called several times in a loop with fixed NUM_ITERATIONS. Do you think shifting the loop into a parent kernel would help i.e., is the kernel overhead lesser when invoked from the GPU as compared to the CPU?

Would it be possible to use Dynamic Parallelism to increase performance of the algorithm below? If so, could you suggest a similar use case for dynamic parallelism that would help me implement it in my own program?

for (i = 0; i < NUM_ITERATIONS; i++)
{
    kernelGPU<<<gridSize, blkSize>>>(
        d_a,
        d_b,
        d_c,
        d_d,
        d_e,
        R,
        V,
        N
    );
}
  • I tested the same kind of problem with a MonteCarlo algorithm and it didn't change anything. If you are not changing data, I mean copying from Host to Device or Device to Host. It won't make any difference. You can try to put your loop in a parent kernel but it won't change anything. Further if you are kernel take some times at least more than few milliseconds, you won't gain performance. – user2076694 Apr 10 '14 at 22:14
  • Yup you are correct. I actually implemented by putting the loop in the parent kernel and using DP but the performance became much worse (around 50% slower). – Kanishk Kanoria Apr 12 '14 at 12:57
  • I guess that shows that DP is not suited to such a structure and the overhead of the kernel call exists irrespective of whether it is invoked from the host or from the device itself – Kanishk Kanoria Apr 12 '14 at 12:58

1 Answers1

1

I actually implemented by putting the loop in the parent kernel and using DP but the performance became much worse (around 50% slower).