CUDA dynamic parallelism: invalid global write when using texture memory

Question

I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value.

If the child kernel, say a different one, doesn't use texture memory, everything is fine. If I don't call a kernel within a kernel, the results are the expected ones. As long as I use texture memory which in my case is very useful due to spatial locality and fast filtering, cuda-memcheck returns "Invalid __global__ write of size 4".

I've seen that, in dynamic parallelism in the programming guide, one must be carefull when using texture memory that may result in inconsistent data, but here the child kernel does not even launch.

I've tried __syncthreads() and cudaDeviceSynchronize placed before or after the call to texture memory but nothing.

Are there some already reported cases, am I doing something wrong or it is just that you can't use texture memory that way?

system: gtx titan black (sm_3.5), CUDA6.0.

EDIT: some example code to illustrate.

Obviously, EField is declared and filled before. HANDLE_ERROR comes from the book.h include from CUDA by examples

Here is a compilable code:

#include "cuda.h"
#include "/common/book.h"

#define DIM 2048

texture<float4, 2, cudaReadModeElementType> texEField;

__device__ int oneChild = 0;


__global__ void test_cdp( float x0, float y0 ){
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int idx = x + y * blockDim.x * gridDim.x;

    printf("Propa started from thread %d\n", idx);      
    float4 E = tex2D( texEField, x0, y0 );

    printf("E field %f -- %f\n", E.z, E.w);     
    if( oneChild < 1 ){
        test_cdp<<<1, 1>>>(x0, y0);
        oneChild++;
    }
}

int main( void ){   

    //Start of texture allocation

    float4 *EField = new float4 [DIM*DIM];
    for( int u = 0; u < DIM*DIM; u++ ){
        EField[u].x = 1.0f;
        EField[u].y = 1.0f;
        EField[u].z = 1.0f;
        EField[u].w = 1.0f;
    }   


    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();

    float4 *dev_EField;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_EField, DIM * DIM * sizeof(float4) ) );

    HANDLE_ERROR( cudaMemcpy( dev_EField, EField, DIM * DIM * sizeof(float4), cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaBindTexture2D( NULL, texEField, dev_EField, desc, DIM, DIM, sizeof(float4) * DIM ) );

    texEField.addressMode[0] = cudaAddressModeWrap;
    texEField.addressMode[1] = cudaAddressModeWrap;
    texEField.filterMode = cudaFilterModeLinear;
    texEField.normalized = true;

    test_cdp<<<1, 1>>>(0.5, 0.5);

    HANDLE_ERROR( cudaFree( dev_EField ) );
    HANDLE_ERROR( cudaUnbindTexture( texEField ) );
    return 0;
}

You should add some example code demonstrating this problem to your post. — Ross Ridge, Aug 20 '14 at 16:22
Done. Some of the code is missing which has pretty much nothing to do with the problem. — salvaS, Aug 20 '14 at 19:10
Could you post a complete compilable sample that hits this issue ? From a quick glance, it does seem like your kernel will keep recursively launching itself, since the variable `oneChild` is a local to each kernel invocation. If you mean to only have one child launch globally, can you try making `oneChild` a file global (i.e. declared as `__device__ int oneChild = 0;` outside the kernel) — Vyas, Aug 20 '14 at 21:07
Note that your edited kernel now has a race condition. I would suggest incrementing `oneChild` immediately *before* launching the child kernel. The CUDA execution model does not guarantee that the parent kernel thread will reach the update of `oneChild` before the child kernel tests it. — Robert Crovella, Aug 21 '14 at 16:43
True. Actually, in the "final" code this won't matter because the condition to reach the launch of the child is given by another variable not incremented in the condition itself, but that's a good advice. Might not be the place to ask, but in my code, the excution of a thread takes quite some time (few seconds, which is "normal" in my case) bu then when few child recursive threads are launched, they all have the same `idx`. I thought they were launched in parallel, but here it would look like it's serialized, have I missed something? — salvaS, Aug 22 '14 at 18:24

score 0 · Accepted Answer · edited May 23 '17 at 10:33

0

In the future, please provide a complete, compilable code. SO expects this. As one example of uncertainty, your kernel definition is test_cdp. Your kernel called from host code is test2_cdp. Please don't make others guess at your intentions, or play 20 questions to clarify your code. Post a complete, compilable code, that requires no additions or changes, that demonstrates the issue. This is the reason for the close votes on your question.

I can see 2 problems.

If you were to fix the above issue, this code as written could lead to an endless chain of child kernels being launched. It appears that you may think the oneChild variable is somehow shared between parent and child kernels. It is not. Therefore every launched child kernel will see that oneChild is zero, and it will launch its own child kernel. I don't know where this sequence would end, but its not a sensible use of CDP.
CDP does not support module-scope texture referencing from device-launched kernels. Use texture objects instead.

edited May 23 '17 at 10:33

Community

1
1

answered Aug 20 '14 at 21:23

Robert Crovella

143,785
11
213
257

Thanks Robert. I have edited a compilable code, removed the mistakes. If you compile it, still the same error. Considering your comments: it is not possible (according to the programming guide if I have well understood) that the excecution of the parent kernel finishes before its child, it has to wait. Regarding your last, it seems that's the answer I was looking for. I don't remember seeing this in the guide v.6.0. – salvaS Aug 21 '14 at 07:23
It's even in the 5.5 programming guide. It's in section C.3.1.6.2 in 5.5, 6.0, and 6.5 programming guides. Yes, you are correct about parent/child synchronization, I have updated my answer. – Robert Crovella Aug 21 '14 at 14:58
Thanks. Your last suggestion was the answer. Indeed, present in the guide, I just obliterated it I think... – salvaS Aug 22 '14 at 18:25

CUDA dynamic parallelism: invalid global write when using texture memory

1 Answers1