How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

Question

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a print statement in runParatron that executes AFTER the first "[" is printed outside the for loop. I tried to run cudaDeviceSynchronize, but it wouldn't compile stating that host code can't be executed on device code, and that cudaDeviceSynchronize is undefined in device code. Is there any way to wait until the children kernels are done for this? I see other posts, examples, and tutorials using cudaDeviceSynchronize within kernels, so perhaps I am missing something basic? Help would be thoroughly appreciated.

__global__ void runMLP(double* x, double* outputs, double* weights, activation_function* A_Fs, int*     CIL, int layers, int bias, int* WLO, int* OLO) {

    if (CIL[0] > 511) {
        copyElements << <CIL[0] / 32, 32 >> > (outputs, x, CIL[0]);
        //I WOULD ALSO LIKE TO WAIT HERE
    }
    else
        for (int i = 0;i < CIL[0];i++) {
            outputs[i] = x[i];
        }

    for (int i = 1;i < layers;i++) {
        printf("----------------------Layer %d :: InputSize %d :: Layer weight offset %d ::     Layer output offset %d----------------------\n", i, CIL[i-1], WLO[i-1], OLO[i]);
        runParatron << < (CIL[i] / 32) + 1, 32 >> > (outputs + OLO[i - 1], outputs +     OLO[i], weights + WLO[i - 1], A_Fs[i], CIL[i - 1], CIL[i], bias);
        //cudaDeviceSynchronize(); //THIS IS WHERE I NEED TO WAIT UNTIL NEXT ITERATION
    }
    if (A_Fs[layers - 1] == SOFTMAX) {
        double* temp = outputs + OLO[layers - 1];
        printf("[");
        for (int i = 0;i < CIL[layers-1];i++) {
            printf("% d, ", temp[i]);
        }
        printf("]\n");
        double denom = 0;
        for (int i = 0;i < CIL[layers - 1];i++) {
            denom += temp[i];
        }
        if (denom < DBL_MIN)
            denom = DBL_MIN;
        for (int i = 0;i < CIL[layers - 1];i++) {
            temp[i] /= denom;
        }
    }
}

For example, here is the output where the "[" comes before the child kernel output:

//All Cell: starting lines are produced from child kernel
[Cell: 0 :: weightOffset 0 :: AF 2 //As you can see, there is the "[" here when it should be
Cell: 1 :: weightOffset 6 :: AF 2
Cell: 2 :: weightOffset 12 :: AF 2
Cell: 3 :: weightOffset 18 :: AF 2
-502657059,  2118981138,  1645236453, ] //Down here!

With respect to `cudaDeviceSynchronize()` (in device code), see [here](https://forums.developer.nvidia.com/t/cudadevicesynchronize-from-device-code-is-deprecated/215900) and [here](https://forums.developer.nvidia.com/t/cant-run-my-program-on-rtx-4080/237457). — Robert Crovella, Jan 07 '23 at 22:16
Thanks for the swift response, and ahahaha just my luck. Would you be able to explain why it was deprecated? Or point me to a source explaining it. I've got it somewhat figured out with atomic counting for now, but I would love to learn more about the synchronization interactions with dynamic parallelism. — yugi957, Jan 07 '23 at 23:04
@yugi957 According to Nvidia performance wasn't good enough. The new way to go seems to be to do a tail launch when you need the results of your child kernels. Naturally this has the disadvantage that one might need to load data from global memory again instead of shared memory. But I guess waiting parent kernels blocking compute resources was part of the problem. — paleonix, Jan 07 '23 at 23:46
They talk about these changes in the [CUDA 12 New Features and Beyond](https://youtu.be/_5mnVGOxq50) video at 3:39. — paleonix, Jan 07 '23 at 23:52
Amazing @paleonix, I appreciate it a lot. This video helped me so much. Lots of interesting stuff honestly. But I don't believe tail launch is what I'm looking for, as that consists of the children kernel executing AFTER the parent kernel is finished. I need the parent kernel to WAIT after child kernel launch to continue the parent kernel, although I now realize that may be extremely inefficient, and have an idea to refactor my code to not need this logic. — yugi957, Jan 09 '23 at 00:32

score -2 · Answer 1 · answered Jan 07 '23 at 22:55

So I added an atomic counter and incremented it by one at the end of each child kernel. Then I put a while loop after the child kernel call checking to see if the counter had reached the amount of calls I wanted to finish yet. This fixed it. Let me know if anyone needs code for or clarification.

How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

1 Answers1