-1

I am learning about unrolling loops to optimize kernel computation.

This is a code snippet from the book Professional CUDA C Programming:

if (idx + 4 * blockDim.x <= n)
{
    int a1 = g_idata[idx];
    int a2 = g_idata[idx + blockDim.x];
    int a3 = g_idata[idx + 2 * blockDim.x];
    int a4 = g_idata[idx + 3 * blockDim.x];
    tmpSum = a1 + a2 + a3 + a4;
}

In my understanding, each thread works on 4 data blocks and processes a single element from each data block. So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to reduceSmemUnroll<<<grid.x / 4, block>>>.

Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;

while (i < n) {
    sdata[tid] += g_idata[i] + g_idata[i+blockSize];
    i += gridSize;
}
__syncthreads();

My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
kingwales
  • 129
  • 8
  • 1
    I’m voting to close this question because it is too trivial, and the answer basically appears in the linked presentation. – einpoklum Sep 03 '21 at 07:45

1 Answers1

0

Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:

Halve the number of blocks, and replace single load:

[code snippet]

with two loads and [the] first add of the reduction

Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Thanks. Btw, do we really need the while loop? Lets say our data set size is ```size```, if we configure ```grid.x = ((size + block.x - 1) / block.x)```, then blockSize*gridDim.x is already the problem size...it seems it will only loop once.. – kingwales Sep 02 '21 at 22:23
  • You need the while loop if your grid is to cover more than 2*grid.x elements. – einpoklum Sep 03 '21 at 07:45