1

I'm using CUDA 9 on a Pascal architecture, trying to implement a reasonable block reduction using warp shuffle intrinsics plus a shared memory intermediate step.

Examples I've seen on the web:

The first of those links illustrate the shuffle intrinsics with _sync, and how to use __ballot_sync(), but only goes as far as a single warp reduction.

The second of those links is a Kepler-era article that doesn't use the newer _sync but does illustrate a full block level reduction by staging individual warp reductions into shared memory, then reading those values back into warp 0 and doing one more warp reduction to achieve a block reduction.

My problem is different from these and other examples I've seen on the web is that my reduction operator isn't a simple sum, and my "N" usually won't be a nice power of 2. From my debugging efforts, it seems that when an active thread (included in the mask provided by __ballot_sync() tries to obtain a value from an inactive thread (not included in the mask), it retrieves a "0". A "0" would work fine regardless for a sum reduction, but not for a min reduction. ).

take the following code excerpt:

__device__ void warpReduceMin(uint32_t &val, uint32_t mask)
{
   for (int offset=16; offset>0; offset /= 2)
   {
       uint32_t tmp;
       tmp = __shfl_down_sync(mask, val, offset);
       val = (tmp<val) ? tmp : val;
    }
}

__global__ void my_kernel(uint32_t *d_data, uint32_t N)
{
    __shared__ uint32_t shmem[32];

    if (threadIdx.x >= N) return;

    uint32_t mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < blockDim.x)
    uint32_t val = d_data[threadIdx.x];
    uint32_t warp_id = threadIdx.x / warpSize;
    uint32_t lane_id = threadIdx.x % warpSize;

    warpReduceMin(val, mask);
    // val is erroneously set to "0" for the active threads in last warp
    if (lane_id == 0)
        shmem[warp_id] = val;
    __syncthreads();
    val = shmem[lane_id];
    mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < (blockDim.x+warpSize-1)/warpSize );
    if (warp_id == 0)
        warpReduceMin( val, mask );

    // do something with result...

}

If I call the kernel with a block size of 1024, and I have 1024 elements in my data (N=1000)...I get the expected answer. But if I call the kernel with a block size of 1024, with N=1000, then I can see through printf debugging that my last warp of incomplete data (warp_id == 31; elements = 992:999), that the initial offset of 16 is pulling a "0" from a thread which isn't even involved in the warp.

So I'm not quite sure where my error is.

talonmies
  • 70,661
  • 34
  • 192
  • 269
wrjohns
  • 484
  • 4
  • 14
  • So are you saying the fix would be to remove the early return, and then provide a large dummy value to "val" for cases where threadIdx.x >= N? – wrjohns Jul 23 '19 at 20:21
  • I would certainly expect that to fix it. I think other fixes are also possible. If you guarantee all warps are fully loaded that way, I think you can get rid of some of your code. – Robert Crovella Jul 23 '19 at 20:27
  • 1
    shuffling down from a lane that is inactive produces undefined behavior. The mask parameter doesn't have any bearing on this. In the "last warp" for the N=1000 case, you have 8 active threads/lanes, and the rest are inactive. In the first shuffle down operation, you are asking warp lane zero to pick up the value from warp lane 16. Warp lane 16 is not participating. So the result you get is undefined. Again, mask has nothing to do with this. Mask is a parameter that forces reconvergence, but it cannot have any effect on threads that are prevented from participating due to conditional code. – Robert Crovella Jul 23 '19 at 20:51
  • "A "0" would work fine regardless for a sum reduction, but not for a min reduction." Exactly my problem, indeed this effect was hiding bugs. How does this question not have any upvotes? – Tyson Hilmer Dec 10 '21 at 10:08

1 Answers1

2

There are probably other things I could quibble about in this code, but the proximal issue that is causing the unexpected result is that you are doing an undefined warp shuffle here:

tmp = __shfl_down_sync(mask, val, offset);

In the case of the last warp (ie. warp ID 31) in the "last" threadblock, i.e. the one where this line is preventing some threads from participating:

if (threadIdx.x >= N) return;

you have the following scenario (let's consider a single threadblock of 1024 threads, with N=1000): In the last warp, there are 8 active threads, whose threadIdx.x values are from 992 to 999, inclusive. The other threads in that warp would be "removed" by the if statement. So we have 8 active threads, and in the first pass of the warp shuffle, offset is 16. So lane 0 (thread 992) is requesting the value from lane 16 (thread 1008). But thread 1008 is not participating. That combination (either the source or destination lane not participating in the warp shuffle) is specifically identified in the programming guide as producing undefined results:

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

Specifying the mask parameter to any particular value does not change this behavior/requirement. There is a careful description of the mask parameter here. In fact your mask is 0xFF (selecting 8 threads) entering into this problematic case, so it is "consistent" with the fact that you have 8 active threads, but doesn't address the warp-shuffle inactive source lane problem.

I think the simplest way to fix this is to make sure that each warp is fully active, and populated with an appropriate value for reduction, entering into each warp shuffle operation. If you do that, you can get rid of some other aspects of your code that I consider slightly problematic. Here's a "fixed" example:

$ cat t1456.cu
#include <stdio.h>
#include <stdint.h>
__device__ void warpReduceMin(uint32_t &val, uint32_t mask)
{
   for (int offset=16; offset>0; offset /= 2)
   {
       uint32_t tmp;
       tmp = __shfl_down_sync(mask, val, offset);
       val = (tmp<val) ? tmp : val;
    }
}

__global__ void my_kernel(uint32_t *d_data, uint32_t N)
{
    __shared__ uint32_t shmem[32];

    uint32_t mask = 0xFFFFFFFFU;
    uint32_t val = (threadIdx.x < N)?(d_data[threadIdx.x]):0xFFFFFFFFU;
    uint32_t warp_id = threadIdx.x / warpSize;
    uint32_t lane_id = threadIdx.x % warpSize;

    warpReduceMin(val, mask);
    if (lane_id == 0)
        shmem[warp_id] = val;
    __syncthreads();
    val = shmem[lane_id];
    if (warp_id == 0)
        warpReduceMin( val, mask );
    if (threadIdx.x == 0)
      printf("val = %u\n", val);
}

int main(){
  const uint32_t N = 1000;
  uint32_t *d_data, *h_data = (uint32_t *)malloc(N*sizeof(uint32_t));
  cudaMalloc(&d_data, N*sizeof(uint32_t));
  for (int i = 0; i < N; i++)
    h_data[i] = i+1;
  cudaMemcpy(d_data, h_data, N*sizeof(uint32_t), cudaMemcpyHostToDevice);
  my_kernel<<<1,1024>>>(d_data, N);
  cudaDeviceSynchronize();
}

$ nvcc -o t1456 t1456.cu
$ cuda-memcheck ./t1456
========= CUDA-MEMCHECK
val = 1
========= ERROR SUMMARY: 0 errors
$
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257