CUDA Dynamic Parallelism Deferencing Global Memory

Question

To test out dynamic parallelism, I wrote a simple code and compiled it on GTX1080 with the following commands.

nvcc -arch=sm_35   -dc dynamic_test.cu -o dynamic_test.o
nvcc -arch=sm_35   dynamic_test.o  -lcudadevrt -o dynamic_test

However, the output is not as expected. It seems like the pointers passed to the child kernel are de-referenced.

#include <stdlib.h>
#include <stdio.h>
#include <cublas_v2.h>
#include <cuda_runtime_api.h>

__global__ void child(int *a, int *b, int *c){

        int tid = threadIdx.x;
        c[tid] = a[tid] + b[tid];
}


__global__ void Parent(int *a, int *b, int *c){

        int tid = threadIdx.x;
        const int n = 10;

        a[tid] = tid;
        b[tid] = 2*tid;
        c[tid] = -10;

        __syncthreads();
        cudaDeviceSynchronize();
        if (tid == 1){
          child<<<1,n>>>(a,b,c);
          cudaDeviceSynchronize();
        }
}


int main(){

        int *d_a, *d_b, *d_c;
        const int n = 10;
        int a[n],b[n],c[n],i;

        cudaMalloc((void**)&d_a,n*sizeof(int));
        cudaMalloc((void**)&d_b,n*sizeof(int));
        cudaMalloc((void**)&d_c,n*sizeof(int));

        Parent << < 1, n >>> (d_a,d_b,d_c);
        cudaDeviceSynchronize();

        cudaMemcpy(a,d_a,n*sizeof(int),cudaMemcpyDeviceToHost);
        cudaMemcpy(b,d_b,n*sizeof(int),cudaMemcpyDeviceToHost);
        cudaMemcpy(c,d_c,n*sizeof(int),cudaMemcpyDeviceToHost);

        for(i=0; i<n; i++){
           printf("a[%d] = %d\n",i,a[i]);
        }
        for(i=0; i<n; i++){
           printf("b[%d] = %d\n",i,b[i]);
        }
        for(i=0; i<n; i++){
           printf("c[%d] = %d\n",i,c[i]);
        }

        cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

        return 0;
}

Here's the output:

a[0] = 1
a[1] = 0
a[2] = 4208446
a[3] = 0
a[4] = 0
a[5] = 0
a[6] = 0
a[7] = 0
a[8] = 0
a[9] = 0
b[0] = 3
b[1] = 0
b[2] = 4204323
b[3] = 0
b[4] = 4205312
b[5] = 0
b[6] = 4732449
b[7] = 0
b[8] = 4205680
b[9] = 0
c[0] = 194906208
c[1] = 32767
c[2] = 4204143
c[3] = 0
c[4] = 4205616
c[5] = 0
c[6] = 4732608
c[7] = 0
c[8] = 4231155
c[9] = 0

Reading from the programming guide, I should be able to pass global variables to child kernels without causing any deferencing. I am not sure why the output is incorrect. My ultimate goal is using the cublas library from within kernels. Any suggestion in that direction will also be helpful.

Is this also your first CUDA program? Do you know that your CUDA installation is working correctly? Because I have compiled and run your code and it works as expected. What happens if run it using cuda-memcheck? — talonmies, Oct 11 '16 at 06:18
Thank you for pointing me to that direction. It is now working with cuda 8.0. — JYC, Oct 11 '16 at 15:28

score -2 · Answer 1 · answered Oct 11 '16 at 15:28

-2

The problem was solved by switching from cuda 7.5 to cuda 8.0.

answered Oct 11 '16 at 15:28

JYC

1
3

1

When I ran your code, I used CUDA 7.5, so the version was not the problem. – talonmies Oct 11 '16 at 16:03
I am running this code on a cluster because I have no access to those graphics cards otherwise. It could be that the installation for 7.5 is incorrect. However, I am not even sure if that's the case because other cuda programs without dynamic parallelism compile and run fine. – JYC Oct 12 '16 at 06:24

CUDA Dynamic Parallelism Deferencing Global Memory

1 Answers1