Parallel reduction example

Question

I found this parallel reduction code from Stanford which uses shared memory.

The code is an example of 1<<18 number of elements which is equal to 262144 and produces correct results.

Why do I get the correct results for certain numbers of elements and for other numbers of elements, like 200000 or 25000 I get different, unexpected results?
It looks to me as if it's always appointing the needed thread blocks.

score 3 · Answer 1 · edited May 11 '23 at 14:57

3

This code causes the bug:

// launch a single block to compute the sum of the partial sums
block_sum<<<1,num_blocks,num_blocks * sizeof(float)>>>

Suppose num_blocks is 13, then in the kernel blockDim.x / 2 will be 6, and

if(threadIdx.x < offset)
{
    // add a partial sum upstream to our own
    sdata[threadIdx.x] += sdata[threadIdx.x + offset]; 
}

will only add the first 12 elements causing the bug.

When the element count is 200000 or 250000, num_blocks will be odd and cause the bug, while for even num_blocks it will work fine.

edited May 11 '23 at 14:57

paleonix

2,293
1
13
29

answered Feb 02 '13 at 12:54

rps

221
2
6

thank you for your reply. I tried testing it with only number of elements that produce even number of blocks but still I wouldn't always get the correct result compared to the host result. My observation is that this example doesn't only require to have even number of blocks but also block_size%num_blocks=0. Any comments on this? – user1280671 Feb 02 '13 at 20:03
i doubt on the second kernal launch, whether you verified the output of first kernal launch? – rps Feb 03 '13 at 04:02
How did you cater for the remaining part? – mibrahimy Nov 27 '19 at 14:51

score 0 · Answer 2 · answered Feb 05 '13 at 12:59

0

This kernel is sensitive to the blocking parameters (grid and threadblock size) of the kernel. Are you invoking it with enough threads to cover the input size?

It is more robust to formulate kernels like this with for loops - instead of:

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

something like:

for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x;
      i < N;
      i += blockDim.x*gridDim.x ) {
    sum += in[i];
}

The source code in the CUDA Handbook has lots of examples of "blocking agnostic" code. The reduction code is here:

https://github.com/ArchaeaSoftware/cudahandbook/tree/master/reduction

answered Feb 05 '13 at 12:59

ArchaeaSoftware

4,332
16
21

Thank you for participating. The program I reference at my original post doesn't have a problem invoking the needed threads and thread blocks. Block size is 512 which is the max for my gpu compute capability 1.3. It look like there's a problem if the number of blocks isn't a power of 2. – user1280671 Feb 06 '13 at 14:17
Also, there's a problem if my input size results to calculating a number of blocks greater than 512 which is used as block size for the second kernel invocation. In the cuda reduction archive you pointed out to me do you have in mind which i could I use for 307200 input size which uses 600 blocks and 512 threads per block? – user1280671 Feb 06 '13 at 14:23

Parallel reduction example

2 Answers2