I'm using CUDA 9 on a Pascal architecture, trying to implement a reasonable block reduction using warp shuffle intrinsics plus a shared memory intermediate step.
Examples I've seen on the web:
The first of those links illustrate the shuffle intrinsics with _sync, and how to use __ballot_sync()
, but only goes as far as a single warp reduction.
The second of those links is a Kepler-era article that doesn't use the newer _sync but does illustrate a full block level reduction by staging individual warp reductions into shared memory, then reading those values back into warp 0 and doing one more warp reduction to achieve a block reduction.
My problem is different from these and other examples I've seen on the web is that my reduction operator isn't a simple sum, and my "N" usually won't be a nice power of 2. From my debugging efforts, it seems that when an active thread (included in the mask provided by __ballot_sync()
tries to obtain a value from an inactive thread (not included in the mask), it retrieves a "0". A "0" would work fine regardless for a sum reduction, but not for a min reduction. ).
take the following code excerpt:
__device__ void warpReduceMin(uint32_t &val, uint32_t mask)
{
for (int offset=16; offset>0; offset /= 2)
{
uint32_t tmp;
tmp = __shfl_down_sync(mask, val, offset);
val = (tmp<val) ? tmp : val;
}
}
__global__ void my_kernel(uint32_t *d_data, uint32_t N)
{
__shared__ uint32_t shmem[32];
if (threadIdx.x >= N) return;
uint32_t mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < blockDim.x)
uint32_t val = d_data[threadIdx.x];
uint32_t warp_id = threadIdx.x / warpSize;
uint32_t lane_id = threadIdx.x % warpSize;
warpReduceMin(val, mask);
// val is erroneously set to "0" for the active threads in last warp
if (lane_id == 0)
shmem[warp_id] = val;
__syncthreads();
val = shmem[lane_id];
mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < (blockDim.x+warpSize-1)/warpSize );
if (warp_id == 0)
warpReduceMin( val, mask );
// do something with result...
}
If I call the kernel with a block size of 1024, and I have 1024 elements in my data (N=1000)...I get the expected answer. But if I call the kernel with a block size of 1024, with N=1000, then I can see through printf debugging that my last warp of incomplete data (warp_id == 31; elements = 992:999), that the initial offset of 16 is pulling a "0" from a thread which isn't even involved in the warp.
So I'm not quite sure where my error is.