0

Having finally gotten Dynamic Parallelism up and running, I'm trying to now implement my model with it. It took me a while to figure out that some strange output resulted from needing to use cudaDeviceSynchronize() to make the parent kernel wait for the child kernel to finish.

It seems there is something wrong with the device function I defined as arrAdd. Here's a table of outputs before and after each child kernel in the k2 parent kernel.

Initially    : k1   = { -1   0   0   0   0 }
Post arrInit : temp = { .25 .25 .25 .25 .25}
Post arrMult : temp = {-.25  0   0   0   0 }
post arrAdd  : temp = { -8   0   0   0   0 }
Expected     : temp = {-.50  0   0   0   0 }


__global__ void k2(double* concs, int* maxlength, double* k1s, double* k2s, double * temp, double* tempsum)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    double a21 = .25;

    arrInit<<< 1, *maxlength >>>(temp, a21);                //temp = a21
    cudaDeviceSynchronize();
    arrMult<<< 1, *maxlength >>>(k1s, temp, temp);          //temp = a21*k1
    cudaDeviceSynchronize();
    arrAdd<<< 1, *maxlength >>>(temp, temp, temp);          //temp = 2*a21*k1
    cudaDeviceSynchronize();
}

__global__ void arrAdd(double* a, double* b, double* c)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]+b[idx];
}
__global__ void arrMult(double* a, double* b, double* c)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]*b[idx];
}
__global__ void arrInit(double* a, double b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b;
}

1 Answers1

2

You probably don't need to synchronize with the parent kernel. Child kernels execute in the order specified by parent kernel and the end of parent kernel is an implicit synchronization point with the last child kernel.

When you use dynamic parallelism, be careful about these items:

  1. The deepest you can go is 24 (CC=3.5).

  2. The number of dynamic kernels pending for launch at the same time is limited ( default 2048 at CC=3.5) but can be increased.

  3. Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.

I guess your strange wrong results originate from the second factor mentioned above. When you hit the limit, some of dynamic kernels simply don't run and if you don't check for errors, you won't notice because error creation mechanism is per thread.

You can increase this limit by cudaDeviceSetLimit() having cudaLimitDevRuntimePendingLaunchCount as the limit. But the more you specify, the more you consume global memory space. Have a look at section C.4.3.1.3 of the documentation here.

Farzad
  • 3,288
  • 2
  • 29
  • 53
  • Hmmm, I appreciate the answer, but I kind of doubt that I'm violating rule #2. My parent kernel calls 5 child kernels at a time, so I should only have about 26 kernels active in this example. This is very informative however, and these hints will hopefully keep me from falling into a pitfall in the future. Additionally, I definitely do need to synchronize after the child kernel! Without the sync, I get the wrong answer. Strange, because I thought there was a sync after child kernels like you say. – Hair of Slytherin Oct 22 '13 at 23:43
  • Maybe you misunderstood. Parent and child run asynchronous to each other. When you call a child kernel and it returns, there's no guarantee that child kernel has even started let alone finished. But when parent kernel returns, it's guaranteed that both parent and child kernels are done (if above conditions are respected). Another issue is that when you call multiple child kernels at default stream (like you did), they queue up in the order they have been called by parent. So the order they run is the same as the order they have been called by parent. – Farzad Oct 23 '13 at 01:05
  • One more thing: each thread in parent kernel calls a child kernel. As a result, in above code you can have up to __parent_grid_size*parent_block_size*(3*maxlength)__ dynamic kernels pending for launch. – Farzad Oct 23 '13 at 01:08
  • When using dynamic parallelism, you can use the same kind of cuda API (and kernel) error checking that you do in host code. If you're having trouble, it's recommended. It would tell you definitively for example if you were hitting issue #2 that @Farzad describes. You can also use `cuda-memcheck` which may shed light on what is going on. Since you haven't provided a complete reproducer or even shown the code that you invoke the parent kernel with, nobody could possibly tell if you're hitting issue #2. Don't forget each parent kernel *thread* is calling the child kernels. Upvoting answer. – Robert Crovella Oct 23 '13 at 02:12