I have such a 3D kernel that I currently run on one block:
// The two following variables are set elsewhere in the program.
// I give them possible value here for demonstration purposes.
int* N = {14, 5, 1};
int L = 2; // N's size - 1
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.x * blockDim.y + threadIdx.y;
int idz = blockIdx.x * blockDim.z + threadIdx.z;
int idxInc = idx + 1; // for not to waste threads whose idx = 0
if (idxInc >= 1 && idxInc <= L)
{
if (idy < N[idxInc])
{
if (idz < N[idxInc-1])
{
dw[ idxInc ][ idy ][ idz ] = 0;
}
}
}
If I launch this kernel on one block, whose dimensions are {2, 5, 14}, everything is alright. It's exactly the number of threads needed for each dimension of the block, for the kernel to do the job on the data as defined in the two first lines. Now, I don't see how to divide this work among multiple blocks. My brain bugs simply trying to find the right amount of thread for each dimension over two blocks. Moreover, L may vary (but I might put a restriction on this), and more likely N[1] will vary a lot (it's 5 in this example, but could be 128, 256, or 2048...). So I have to find an algorithm that automatically balance the number of blocks, and the number of threads in each of the three dimensions of a block.
I really don't see how to do, and now I feel stupid! I begin to think that I should just stop playing with 3 dimensions... Or maybe is there a simple trick I just can't see...
Some help? Thank you!
Edit: to serially check the result...
for (layer = 1; layer <= L; layer++)
{
for (i = 0; i < N[layer]; i++)
{
for (j = 0; j < N[layer-1]; j++)
{
printf("%1.0f", dw[ layer ][ i ][ j ]);
}
printf("\n");
}
printf("\n");
}
Every number displayed should be 0.