What is the proper way to use stride in cuda to do multiblock reduction?

Question

Hello everyone I'm trying to use grid-stride method and atomic functions to do multi-block reduction.
I know that the usual way to do this is to launch two kernels or use lastblock method as directed in this note.(or this tutorial)

However, I thought this could also be done by using grid-stride with atomic code.
As I tested, it worked very well..
until for some number, it gives the wrong answer. (which is very weird)

I have tested for some "n"s and found that I get wrong answer for n = 1234565, 1234566, 1234567.
This is my whole code of doing n sum of 1. So the answer should be n.
Any help or comment is appreciated.

#include<iostream>

__global__ void stride_sum(const double* input,
                           const int size,
                           double* sumOut){
    extern __shared__ double sm[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockDim.x * blockIdx.x + tid;

    //doing grid loop using stride method.
    for(unsigned int s=i;
            s<size;
            s+=blockDim.x*gridDim.x){
        sm[tid] = input[i];
        __syncthreads();

        //doing parallel reduction.
        for(unsigned int ss = blockDim.x/2;ss>0;ss>>=1){
            if(tid<ss && tid+ss<size) sm[tid] += sm[tid+ss];
            __syncthreads();
        }

        //atomically add results to sumOut.
        if(tid==0) atomicAdd(sumOut, sm[0]);
    }
}

int main(){

    unsigned int n = 1234567;
    int blockSize = 4;
    int nBlocks = (n + blockSize - 1) / blockSize;
    int sharedMemory = sizeof(double)*blockSize;

    double *data, *sum;

    cudaMallocManaged(&data, sizeof(double)*n);
    cudaMallocManaged(&sum, sizeof(double));

    std::fill_n(data,n,1.);
    std::fill_n(sum,1,0.);

    stride_sum<<<nBlocks, blockSize, sharedMemory>>>(data,n,sum);

    cudaDeviceSynchronize();

    printf("res: 10.f \n",sum[0]);

    cudaFree(data);
    cudaFree(sum);

    return 0;
}

talonmies · Accepted Answer · 2020-04-01T07:42:40.487

1

You have gotten quite a lot wrong in your implementation. This will work:

__global__ void stride_sum(const double* input,
                           const int size,
                           double* sumOut)
{
    extern __shared__ volatile double sm[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockDim.x * blockIdx.x + tid;

    //doing grid loop using stride method.
    double val = 0.;
    for(unsigned int s=i; s<size; s+=blockDim.x*gridDim.x){
        val += input[i]; 
    }

    // Load partial sum to memory
    sm[tid] = val; 
    __syncthreads();

    //doing parallel reduction.
    for(unsigned int ss = blockDim.x/2;ss>0;ss>>=1){
        if(tid<ss && tid+ss<size) sm[tid] += sm[tid+ss];
        __syncthreads();
    }

   //atomically add results to sumOut.
   if(tid==0) atomicAdd(sumOut, sm[0]);
}

[Never compiled and run, use a own risk]

In short -- do the grid strided summation, then a single shared memory reduction, then a single atomic update. Your implementation has undefined behaviour in a few places, especially the conditionally executed __syncthreads calls and using uninitialized shared memory when some threads fall out of the summation loop.

edited Apr 01 '20 at 07:42

answered Mar 31 '20 at 14:05

talonmies

70,661
34
192
269

Thanks for the help. your code works good! By the way can you be little bit more specific with problems in my code? 1.What do you mean by conditionally executed __syncthreads? 2.I thought tid+ss – chanjure Apr 01 '20 at 02:14
So I figured out 1. by doing sync in stride loop, some threads outside the 'size' may cause the problem, 4. volatile is there to ensure the code, and it does not harm very much on performance. But for 2,3 I still need some help.. – chanjure Apr 01 '20 at 02:36
The float/double thing for `val` was a typo -- like I said in my answer I wrote that code in the browser, there could be mistakes, and in this case there was. as for 3: if threads in the upper half of a warp exit the outer loop, then they don't update the contents of shared memory and leave incorrect data in shared memory. Then threads in the lower half of the warp do the reduction with that incorrect data, and things break – talonmies Apr 01 '20 at 07:46

What is the proper way to use stride in cuda to do multiblock reduction?

1 Answers1