I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value.
If the child kernel, say a different one, doesn't use texture memory, everything is fine. If I don't call a kernel within a kernel, the results are the expected ones. As long as I use texture memory which in my case is very useful due to spatial locality and fast filtering, cuda-memcheck returns "Invalid __global__ write of size 4".
I've seen that, in dynamic parallelism in the programming guide, one must be carefull when using texture memory that may result in inconsistent data, but here the child kernel does not even launch.
I've tried __syncthreads() and cudaDeviceSynchronize placed before or after the call to texture memory but nothing.
Are there some already reported cases, am I doing something wrong or it is just that you can't use texture memory that way?
system: gtx titan black (sm_3.5), CUDA6.0.
EDIT: some example code to illustrate.
Obviously, EField is declared and filled before. HANDLE_ERROR comes from the book.h include from CUDA by examples
Here is a compilable code:
#include "cuda.h"
#include "/common/book.h"
#define DIM 2048
texture<float4, 2, cudaReadModeElementType> texEField;
__device__ int oneChild = 0;
__global__ void test_cdp( float x0, float y0 ){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * blockDim.x * gridDim.x;
printf("Propa started from thread %d\n", idx);
float4 E = tex2D( texEField, x0, y0 );
printf("E field %f -- %f\n", E.z, E.w);
if( oneChild < 1 ){
test_cdp<<<1, 1>>>(x0, y0);
oneChild++;
}
}
int main( void ){
//Start of texture allocation
float4 *EField = new float4 [DIM*DIM];
for( int u = 0; u < DIM*DIM; u++ ){
EField[u].x = 1.0f;
EField[u].y = 1.0f;
EField[u].z = 1.0f;
EField[u].w = 1.0f;
}
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();
float4 *dev_EField;
HANDLE_ERROR( cudaMalloc( (void**)&dev_EField, DIM * DIM * sizeof(float4) ) );
HANDLE_ERROR( cudaMemcpy( dev_EField, EField, DIM * DIM * sizeof(float4), cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaBindTexture2D( NULL, texEField, dev_EField, desc, DIM, DIM, sizeof(float4) * DIM ) );
texEField.addressMode[0] = cudaAddressModeWrap;
texEField.addressMode[1] = cudaAddressModeWrap;
texEField.filterMode = cudaFilterModeLinear;
texEField.normalized = true;
test_cdp<<<1, 1>>>(0.5, 0.5);
HANDLE_ERROR( cudaFree( dev_EField ) );
HANDLE_ERROR( cudaUnbindTexture( texEField ) );
return 0;
}