As the question states, If I have a 3D Grid of Blocks what is the formula to get a global unique index for one thread ?
Lets keep the block itself as one dimensional.
As the question states, If I have a 3D Grid of Blocks what is the formula to get a global unique index for one thread ?
Lets keep the block itself as one dimensional.
// unique block index inside a 3D block grid
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
// global unique thread index, block dimension uses only x-coordinate
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
A bit late to the party but here is how I usually approach this in a pretty generic way in that it supports any number and size of block (even 2D):
// Compute the offset in each dimension
const size_t offsetX = blockDim.x * blockIdx.x + threadIdx.x;
const size_t offsetY = blockDim.y * blockIdx.y + threadIdx.y;
const size_t offsetZ = blockDim.z * blockIdx.z + threadIdx.z;
// Make sure that you are not actually outs
if (offsetX >= sizeX || offsetY >= sizeY || offsetZ >= sizeZ)
return;
// Compute the linear index assuming that X,Y then Z memory ordering
const size_t idx = offsetZ * sizeX * sizeY + offsetY * sizeX + offsetX;
Mind you that I'm not a CUDA ninja.
The existing answer by @djmj is good, but a bit of re-formatting makes it a little more clear what's going on (at least to my brain - which is new to CUDA):
long blockId = blockIdx.z * gridDim.x*gridDim.y
+ blockIdx.y * gridDim.x
+ blockIdx.x;
long threadsPerBlock = blockDim.x;
long i = blockId * threadsPerBlock + threadIdx.x;
blockId
is the sum of the blocks in the complete z
-dimension "slices" (2D grids), plus the blocks in the complete rows of the final (incomplete) slice, plus the blocks in the final (incomplete) row of that (incomplete) slice.
By "complete", I mean the blocks that "precede" the current (x, y, z) block (with respect to the way we're summing them to determine the overall block id).