CUDA Paralell prefix sum error

Question

I'm trying to implement a three phase parallel scan as described in chapter 8 of Programming Massively Parallel Processors 3rd edition (there are any line of code but only instructions). This algorithm allow to use only 1 block with the maximum number of threads in the block, and it is limited by the size of shared memory because all elements must to fit into the shared memory

After some debugging, I encoutered a problem during the sum in Phase 3 (see the code below) when I use a big number of elements, for instance 8192, and more than 1 thread.

The graphic concept of the algorith is the follow:

And below you can see the kernel code:

__global__ 
void efficient_Kogge_Stone_scan_kernel(float *X, float *Y, int InputSize) {
    __shared__ float XY[SECTION_SIZE];
    __shared__ float AUS[BLOCK_DIM];
    //int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Keep mind: Partition the input into blockDim.x subsections: i.e. for 8 threads --> 8 subsections

    // collaborative load in a coalesced manner
    for (int j = 0; j < SECTION_SIZE; j += blockDim.x) {
        XY[threadIdx.x + j] = X[threadIdx.x + j];
    }
    __syncthreads();


    // PHASE 1: scan inner own subsection
    // At the end of this phase the last element of each subsection contains the sum of all alements in own subsection
    for (int j = 1; j < SUBSECTION_SIZE; j++) {
        XY[threadIdx.x * (SUBSECTION_SIZE)+j] += XY[threadIdx.x * (SUBSECTION_SIZE)+j - 1];
    }
    __syncthreads();


    // PHASE 2: perform iterative kogge_stone_scan of the last elements of each subsections of XY loaded first in AUS
    AUS[threadIdx.x] = XY[threadIdx.x * (SUBSECTION_SIZE)+(SUBSECTION_SIZE)-1];
    float in;
    for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
        __syncthreads();
        if (threadIdx.x >= stride) {
            in = AUS[threadIdx.x - stride];
        }
        __syncthreads();
        if (threadIdx.x >= stride) {
            AUS[threadIdx.x] += in;
        }
    }
    __syncthreads();


    // PHASE 3: each thread adds to its elements the new value of the last element of its predecessor's section
    if (threadIdx.x > 0) {
        for (unsigned int stride = 0; stride < (SUBSECTION_SIZE); stride++) {
            XY[threadIdx.x * (SUBSECTION_SIZE)+stride] += AUS[threadIdx.x - 1];  // <-- 
        }
    }
    __syncthreads();


    // store the result into output vector
    for (int j = 0; j < SECTION_SIZE; j += blockDim.x) {
        Y[threadIdx.x + j] = XY[threadIdx.x + j];
    }
}

If I use one thread in the block and 8192 elements, it works perfectly, but if I use more then one thread, result is wrong in XY[5793] (or X[5793] when it finished and stored the results). With 4096 elements and one or more threads up to 1024 threads it works. If I use int instead of float numbers, it works even with 8192 elements with one or more threads.

I tryed to verify also in MATLAB and these are the outputs comparison:

X[5973] = 16788115 ---- MATLAB
X[5973] = 16788114 ---- CPU
X[5973] = 16788116 ---- GPU

As we can see, also the CPU result is different than MATLAB, so after these results I think that the problem is about the floating point addition, but I tell you that I filled the input array with ordered "x.00" float numbers (for instance {1.00, 2.00, 3.00, 4.00 ..... 8192.00}).

Another problem is about the performance, the host code is always faster than the kernel code, with these configuration parameters and these inputs, is it normal?

If you need for the entire source code you can find it here

Floating point isn't associative. There is no guarantee of you sum a list of numbers in floating point in different others that the results will be the same — talonmies, Apr 28 '17 at 06:26
Consider first using an implementation using integers to not have to account for floating point precision issues (it looks like your overall sum can fit nicely in a 32-bit integer). Also - you seem to be using a lot of `__syncthreads()` in there. Finally - why just 1 block? You can never get good performance with just one block of anything. — einpoklum, Apr 29 '17 at 19:10
Yes you're right @einpoklum in fact, I tried with integers and there aren't problems with the result. This algorithm is only a primitive to understand after how works the hierarchical algorithm to perform the prefix sum where I can use multiple blocks. Now I implemented the hierarchical algorithm using integers, and I saw that starting from 65000 elements the kernel becomes faster then the host. The `__syncthreads()` are necessary because of the shared vector where multiple threads can read and write at the same time in difference parts of code. — sgiraz, Apr 29 '17 at 19:38

aka.nice · Accepted Answer · 2017-04-28T08:45:20.787

8192 is 2^13

sum(1..8192) is near 8192^2/2: 8192*8193/2, that is a little more than 2^25.

You thus need 26 bits to represent it (see note below).

Single precision IEEE 754 float only have 24 bit significand, so, depending how the sum is performed (in which order), and eventually on rounding direction (generally it's the default round to nearest, tie to even), then the result might differ.

Note that strictly speaking, the exact sum can be represented in float without rounding, because the last 12 bits are zero, so the significand spans only 14 bits. But it's not true of partial sums.

score 1 · Answer 2 · answered Apr 27 '17 at 16:03

There might be an issue with the first scan:

XY[threadIdx.x * (SUBSECTION_SIZE)+j] += XY[threadIdx.x * (SUBSECTION_SIZE)+j - 1];

This can result in an incoherent reading of the elements in shared memory. When you read the previous element it is not guaranteed that any of the other threads has not updated that value.

Try to break this section into parts by storing the value in a register. Example:

int t =  XY[threadIdx.x * (SUBSECTION_SIZE)+j - 1];
 __syncthreads();
 XY[threadIdx.x * (SUBSECTION_SIZE)+j] += t;

Each thread works in own subsections and doesn't touch the subsections of others threads, so the problem persist, I tried it. — sgiraz, Apr 27 '17 at 16:18

CUDA Paralell prefix sum error

2 Answers2