Reduce not giving consistent results

Question

I am trying to implement my own reduce sum for big 1D arrays. I came up with a reduce kernel and a way to call the kernel several times for step by step reduction to reach a single value. Now I know that this is not the optimal way of computing this (if you see my code it may get to a point where a kernel call needs to be made to add 3 values), but let's obviate that for a moment and try to work with this.

Basically, in short: I call the reduce kernel to each time reduce MAXTHREADS times, in this case 1024. So the size of the array will be reduced by 1024 each time. When the size is smaller than 1024 it seems that works properly, but for bigger values it miserably fails to get the right sum.

This happens with all of the array sizes I tried. What am I missing?

I will also gladly accept comments about the quality of the code but mainly interested in fixing it.

Below I have posted the whole kernel and a snippet of the kernel call. If I have missed some relevant part of the kernel call, please comment and I will fix it. The original code has error checks, all kernels run alway and are never returning CUDAErrors.

Reduce kernel

__global__ void  reduce6(float *g_idata, float *g_odata, unsigned int n){
    extern __shared__ float sdata[];

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(MAXTREADS*2) + tid;
    unsigned int gridSize = MAXTREADS*2*gridDim.x;
    //blockSize==MAXTHREADS
    sdata[tid] = 0;
    float mySum = 0;   
    if (tid>=n) return; 
    if ( (gridDim.x>0) & ((i+MAXTREADS)<n)){
        while (i < n) { 
            sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS]; 
            i += gridSize; 
        }
    }else{
        sdata[tid] += g_idata[i];
    }
   __syncthreads();


    // do reduction in shared mem
   if (tid < 512)
        sdata[tid] += sdata[tid + 512];
    __syncthreads();
    if (tid < 256)
        sdata[tid] += sdata[tid + 256];
    __syncthreads();

    if (tid < 128)
        sdata[tid] += sdata[tid + 128];
     __syncthreads();

    if (tid <  64)
       sdata[tid]  += sdata[tid +  64];
    __syncthreads();


#if (__CUDA_ARCH__ >= 300 )
    if ( tid < 32 )
    {
        // Fetch final intermediate sum from 2nd warp
        mySum = sdata[tid]+ sdata[tid + 32];
        // Reduce final warp using shuffle
        for (int offset = warpSize/2; offset > 0; offset /= 2) 
            mySum += __shfl_down(mySum, offset);
    }
    sdata[0]=mySum;
#else

    // fully unroll reduction within a single warp
    if (tid < 32) {
        warpReduce(sdata,tid);
    }
#endif
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}



__device__ void warpReduce(volatile float *sdata,unsigned int tid) {
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}

Call the kernel for an arbitrary size of `total_pixels`:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>    
#defineMAXTREADS 1024
__global__ void initKernel(float * devPtr, const float val, const size_t nwords)
{
    //https://stackoverflow.com/questions/10589925/initialize-device-array-in-cuda
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (; tidx < nwords; tidx += stride)
        devPtr[tidx] = val;
} 


int main()
{

size_t total_pixels = 5000;
unsigned long int n = (unsigned long int)total_pixels;
float* d_image, *d_aux;
float sum;
cudaMalloc(&d_image, total_pixels*sizeof(float));
cudaMalloc(&d_aux, sizeof(float)*(n + MAXTREADS - 1) / MAXTREADS);


for (int i = 0; i < 10; i++){
    sum = 0;
    cudaMemset(&d_image, 1, total_pixels*sizeof(float));

    int dimblockRed = MAXTREADS;
    int dimgridRed = (total_pixels + MAXTREADS - 1) / MAXTREADS;
    int reduceCont = 0;
    initKernel << < dimgridRed, dimblockRed >> >(d_image, 1.0, total_pixels);

    while (dimgridRed > 1) {
        if (reduceCont % 2 == 0){
            reduce6 << <dimgridRed, dimblockRed, MAXTREADS*sizeof(float) >> >(d_image, d_aux, n);
        }
        else{
            reduce6 << <dimgridRed, dimblockRed, MAXTREADS*sizeof(float) >> >(d_aux, d_image, n);
        }
        n = dimgridRed;
        dimgridRed = (n + MAXTREADS - 1) / MAXTREADS;
        reduceCont++;

    }
    if (reduceCont % 2 == 0){
        reduce6 << <1, dimblockRed, MAXTREADS*sizeof(float) >> >(d_image, d_aux, n);
        cudaMemcpy(&sum, d_aux, sizeof(float), cudaMemcpyDeviceToHost);

    }
    else{
        reduce6 << <1, dimblockRed, MAXTREADS*sizeof(float) >> >(d_aux, d_image, n);
        cudaMemcpy(&sum, d_image, sizeof(float), cudaMemcpyDeviceToHost);
    }
    printf("%f ", sum);
}
cudaDeviceReset();
return 0;
}

That's not an [MCVE](http://stackoverflow.com/help/mcve). If I can't cut, paste & compile what you posted and reproduce your error, I can't help you. It's that simple. — talonmies, Jan 29 '16 at 21:50
@talonmies apologies for the delay. MCVE added. It now behaves differently (still outputs a stream of values and sometimes just 1 is different, but non eof the values is the real answer now) than when called from MATLAB (which is what I was doing) but I am assuming the error comes from the same place: wrong memory access. — Ander Biguri, Feb 01 '16 at 09:43
You never actually compiled and ran that though, did you? (hint: memset sets bytes not words) — talonmies, Feb 01 '16 at 09:56
I have run a fixed version of your "mcve" 100 000 times with an `assert(sum == 1000.f)` instead of a printf and it never fails. Does this actually report a dodgy sum for you? — talonmies, Feb 01 '16 at 10:11
@talonmies try with a different init value and more importantly with a array size bigger than MAXTREADS. — Ander Biguri, Feb 01 '16 at 10:13
This is getting silly. Post code and a problem description that **actually reproduces your problem** — talonmies, Feb 01 '16 at 10:24
@talonmies The code there is reproducing my problem. For the values posted I get wrong values in the sum. — Ander Biguri, Feb 01 '16 at 10:29
@talonmies I also modified the problem description. However, The previous description is rigth if I call this code from MATLAB, while behaviour is different in standalone (still wrong). If I literally copy paste this code, but call it from MATLAB, then it also fails (sometimes) for `total_pixels — Ander Biguri, Feb 01 '16 at 11:10

score 1 · Accepted Answer · answered Feb 01 '16 at 14:45

There is a lot broken in both your host and device code here, and I am not going to attempt to go through all of the problems. but I can at least see:

extern __shared__ float sdata[]; // must be declared volatile for the warp reduct to work reliably

This accumulate code is broken in a lot of ways, but at least:

if (tid>=n) return; // undefined behaviour with __syncthreads() below
if ( (gridDim.x>0) & ((i+MAXTREADS)<n)){
    while (i < n) { 
        sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS]; // out of bounds if i > n - MAXTREADS
        i += gridSize; 
    }
}else{
    sdata[tid] += g_idata[i]; // out of bounds if i > n
}
__syncthreads(); // potential deadlock

The shuffle based reduction is not correct either:

if ( tid < 32 )
{
    // Fetch final intermediate sum from 2nd warp
    mySum = sdata[tid]+ sdata[tid + 32];
    // Reduce final warp using shuffle
    for (int offset = warpSize/2; offset > 0; offset /= 2) 
        mySum += __shfl_down(mySum, offset);
}
sdata[0]=mySum; // must be conditionally executed only by thread 0, otherwise a memory race

Your host code for calling the reduction kernels is a complete mystery. The reduction kernel needs to only be called at most twice, so the loop is redundant. The, in the final phase of the reduction you call the kernel like this:

reduce6 << <1, dimblockRed, MAXTREADS*sizeof(float) >> >(d_aux, d_image, n);

but d_aux only has at most dimgridRed entries, so that is a memory overflow as well.

I think what you really want is something like this:

#include <cstdio>    
#include <assert.h>
#define MAXTHREADS 1024    


__device__ void warpReduce(volatile float *sdata,unsigned int tid) {
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}


__global__ void mymemset(float* image, const float val, unsigned int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < N) {
        image[tid] = val;
        tid += gridDim.x * blockDim.x;
    }
}

__global__ void  reduce6(float *g_idata, float *g_odata, unsigned int n){
    extern __shared__ volatile float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + tid;
    unsigned int gridSize = blockDim.x*gridDim.x;
    float mySum = 0;   
    while (i < n) { 
        mySum += g_idata[i]; 
        i += gridSize; 
    }
    sdata[tid] = mySum;
    __syncthreads();

   if (tid < 512)
        sdata[tid] += sdata[tid + 512];
    __syncthreads();
    if (tid < 256)
        sdata[tid] += sdata[tid + 256];
    __syncthreads();

    if (tid < 128)
        sdata[tid] += sdata[tid + 128];
     __syncthreads();

    if (tid <  64)
       sdata[tid]  += sdata[tid +  64];
    __syncthreads();


#if (__CUDA_ARCH__ >= 300)
    if ( tid < 32 )
    {
        mySum = sdata[tid] + sdata[tid + 32];
        for (int offset = warpSize/2; offset > 0; offset /= 2) {
            mySum += __shfl_down(mySum, offset);
        }
    }
#else
    if (tid < 32) {
        warpReduce(sdata,tid);
        mySum = sdata[0];
    }
#endif
    if (tid == 0) g_odata[blockIdx.x] = mySum;
}


int main()
{
    size_t total_pixels = 8000;
    unsigned long int n = (unsigned long int)total_pixels;
    float* d_image, *d_aux;
    cudaMalloc(&d_image, total_pixels*sizeof(float));
    cudaMalloc(&d_aux, sizeof(float)*(n + MAXTHREADS - 1) / MAXTHREADS);

    for (int i = 0; i < 10000; i++){
        {
            dim3 bsz = dim3(1024);
            dim3 gsz = dim3(total_pixels / bsz.x + ((total_pixels % bsz.x > 0) ? 1: 0));
            mymemset<<<gsz, bsz>>>(d_image, 1.0f, total_pixels);
            cudaDeviceSynchronize();
        }

        int dimblockRed = MAXTHREADS;
        int dimgridRed = (n + MAXTHREADS - 1) / MAXTHREADS;
        float sum;
        reduce6<<<dimgridRed, dimblockRed, MAXTHREADS*sizeof(float)>>>(d_image, d_aux, n);

        if (dimgridRed > 1) {
            reduce6<<<1, dimblockRed, MAXTHREADS*sizeof(float)>>>(d_aux, d_image, dimgridRed);
            cudaMemcpy(&sum, d_image, sizeof(float), cudaMemcpyDeviceToHost);
        } else {
            cudaMemcpy(&sum, d_aux, sizeof(float), cudaMemcpyDeviceToHost);
        }
        assert(sum == float(total_pixels));
    }
    cudaDeviceReset();
    return 0;
}

[for future reference, that is what an MCVE looks like]

But I am not going to waste any more time trying to decipher whatever the twisted logic in the accumulation phase the kernel and the host code was. There are other things that should be fixed (the grid size only needs to be as large as the maximum number of concurrent blocks that will fit on your device), but I leave that as an exercise to the reader.

Thank you, I greatly appreciate your help. Apologies for the mess, I acknowledge that I have been *a bit* difficult in here. Keep up with the good work and again, thank you for your patience. — Ander Biguri, Feb 01 '16 at 14:56
@AnderBiguri: Several points worth noting - the two worst problems only occur when more than 1 block was run, and your mcve included a case which only ran one block. Also a secondary problem only occurred on compute capability >= 3.0 devices, and the code had to be compiled for that otherwise it didn't appear. Lots of small but critical details were glossed over, which made this a needlessly painful process. — talonmies, Feb 01 '16 at 15:37
I see. I will definitely be way more careful next time to post a proper, clear MCVE. I agree totally with you than I made the problem way more difficult by not being clear enough. — Ander Biguri, Feb 01 '16 at 15:44
I modified the code you posted for all integer-like types to be `size_t` and run the code. It seems that there is a upper limit of `pow(2,25)`, anything higher than that gives "Invalid input argument" in `mymemset`. I will open a new question regarding this, with a real MCVE. — Ander Biguri, Feb 01 '16 at 16:20

Reduce not giving consistent results

Reduce kernel

Call the kernel for an arbitrary size of `total_pixels`:

1 Answers1

Linked

Reduce not giving consistent results

Reduce kernel

Call the kernel for an arbitrary size of total_pixels:

1 Answers1

Linked

Call the kernel for an arbitrary size of `total_pixels`: