OpenCL float sum reduction

Question

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):

__local float sum = 0;
int i;
for(i = 0; i < length; i++)
  sum += //some operation depending on i here;

Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.

In pseudo code, I would like to able to write something like this:

int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
  res = sum;

Is there a way?

I have a race condition on sum.

It's called parallel reduction, look it up. It's not as easy as your snippet, but it's not excruciatingly hard either. Just takes some more work. — Thomas, Dec 16 '13 at 14:19

Bruce Dean · Answer 1 · 2013-12-17T02:17:00.653

To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.

Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.

If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).

EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.

__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output) 
{
    int lid = get_local_id(0);
    int group_size = get_local_size(0);
    partial_sums[lid] = data[get_global_id(0)];
    barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = group_size/2; i>0; i >>= 1) {
        if(lid < i) {
            partial_sums[lid] += partial_sums[lid + i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if(lid == 0) {
        output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
    }
}

I'm pretty sure "if(lid < i)" should be "if(lid <= i)", also your algorithm falls apart if at any point the amount of numbers to be summed is odd, you can fix this by adding a check at the beginning of each iteration, and if the amount of numbers is odd, have the worker id 0 sum the last number in addition to it's designated number — Jordan LaPrise, Jul 04 '18 at 21:01

score 4 · Answer 2 · answered Sep 21 '18 at 00:01

I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.

The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.

I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.

This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.

The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.

This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.

Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...

__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
        const size_t globalId = get_global_id(0);
        const size_t localId = get_local_id(0);
        target[localId] = A[globalId+AOffset];

        barrier(CLK_LOCAL_MEM_FENCE);
        size_t blockSize = get_local_size(0);
        size_t halfBlockSize = blockSize / 2;
        while (halfBlockSize>0) {
            if (localId<halfBlockSize) {
                target[localId] += target[localId + halfBlockSize];
                if ((halfBlockSize*2)<blockSize) { // uneven block division
                    if (localId==0) { // when localID==0
                        target[localId] += target[localId + (blockSize-1)];
                    }
                }
            }
            barrier(CLK_LOCAL_MEM_FENCE);
            blockSize = halfBlockSize;
            halfBlockSize = blockSize / 2;
        }
        if (localId==0) {
            output[get_group_id(0)] = target[0];
        }
    }

https://pastebin.com/xN4yQ28N

score 2 · Answer 3 · answered Feb 22 '18 at 09:33

2

You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features

answered Feb 22 '18 at 09:33

Timur Magomedov

64
4

score 1 · Answer 4 · answered Jan 02 '18 at 09:52

A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.

For example, please use the following ridiculously simple CL code:

__kernel void foldKernel(__global float *arVal, int offset) {
    int gid = get_global_id(0);
    arVal[gid] = arVal[gid]+arVal[gid+offset];
}

With the following Java/JOCL host code (or port it to C++ etc):

    int t = totalDataSize;
    while (t > 1) {
        int m = t / 2;
        int n = (t + 1) / 2;
        clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
        clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
        cl_event evFold = new cl_event();
        clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
        clWaitForEvents(1, new cl_event[]{evFold});
        t = n;
    }

The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.

Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

This isn't a very fast way to do it as you are constantly reading/writing from/to global memory. You're much better of doing coalesced reads into local memory, performing whatever operations, and then writing back to global only after you've exhausted all the computation possible within the local memory pool. — colonel of truth, May 11 '18 at 08:17
Using local memory might be faster. Then again, it might not if the OpenCL platform optimizes the kernel accesses to consecutive global memory locations. I guess no up-votes means that other people don't value simplicity as much as I do! — Adam Gawne-Cain, May 12 '18 at 08:55
I can show you benchmarks if you're interested, but it's much faster; local latency is ~100 times lower (faster) than global. You also have less kernel instantiations, which also affects speed, and it's often worth it to suffer additional bank conflicts just to reduce them (largely because global is so slow). While I agree your method is very simple, usually speed takes priority when dealing with OpenCL. — colonel of truth, May 23 '18 at 12:52

OpenCL float sum reduction

4 Answers4

Linked