How does the warp loop unrolling work in Harris' Parallel Reduction tutorial?

Question

I am following the reduction in CUDA presentation by Mark Harris. I've gotten to optimization step #5 and I am confused by the main logic of warpReduce() function:

__device__ void warpReduce(volatile int* sdata, int tid) {
  sdata[tid] += sdata[tid + 32]; // line A
  sdata[tid] += sdata[tid + 16]; // line B
  sdata[tid] += sdata[tid + 8];
  sdata[tid] += sdata[tid + 4];
  sdata[tid] += sdata[tid + 2];
  sdata[tid] += sdata[tid + 1];
}

My question is regarding line A: Why do we need sdata[tid] += sdata[tid + 32]? if tid < 32, then it should start from sdata[tid] += sdata[tid + 16]? Otherwise it will be out-of-range?

Two comments: (a) That deck, while instructive, was done in about 2007 and none of that code is safe or correct on modern hardware and you shouldn't use anything you find in there today, (b) the code you show is using the last warp to sum all the partial sums, not just the tree contained within the data space of the last warp itself. Draw yourself a tree if it helps — talonmies, Aug 30 '21 at 05:52
@talonmies: Why do you believe none of that code is correct? It might not be _optimal_, but it seems correct enough to me. As for safety - device-side code is rarely safe unless you're careful. e.g. no bounds checking and such. — einpoklum, Aug 30 '21 at 09:20
That is implicit intra-warp synchronization programming, which specifically breaks on the last two generations of hardware. There is a reason why shuffle instructions are now the canonical solution for this. They are correct. This is not *on modern hardware* — talonmies, Aug 30 '21 at 09:27

score 0 · Accepted Answer · answered Aug 30 '21 at 09:13

The explanation is that each warp, in a call to the warpReduce() function, handles two input elements, so 32*2 = 64 elements per warp.

Have a look at slide 14 in the slide deck you linked to - you'll see the number of threads is half the number of elements they're working on.

But I'll agree this is a bit surprising/confusing given how, in earlier slides, the addition offset s has condition s < blockDim.x, so that only blockDim.x elements are processed.

How does the warp loop unrolling work in Harris' Parallel Reduction tutorial?

1 Answers1