Strategy for doing final reduction

Question

I am trying to implement an OpenCL version for doing reduction of a array of float.

To achieve it, I took the following code snippet found on the web :

__kernel void sumGPU ( __global const double *input, 
                       __global double *partialSums,
               __local double *localSums)
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];
 }

This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group. Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.

I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.

I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...

I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).

More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.

Normally there is no need to do the final step in GPU. For example if you have 8M=2^23 elements, and each work group is 128=2^7. If you run your kernel twice, you will end up with 512 values, quite small amount. Another iteration, and that goes down to just 4 values. And the CPU can take care of it easily. The best approach is, use the GPU for what is good, small sizes and linear sum is not well suited, so you are better off with the CPU doing it (even in the 512 case probably). — DarkZeros, Apr 28 '16 at 02:49
@ DarkZeros thanks for your remark, the problem is that I would like to do this reduction into a main loop where at each iteration, I have to compute the sum reduction. For that, I need to swap the input and output buffer (with swapping arguments into setKernelArg) at each iteration : that's why I wouldn't like to use CPU for the final sum of partial sums because I think this method breaks the pipeline of execution, i.e the benefits of GPU processing. — , Apr 30 '16 at 15:07
You have to bear in mind, that a GPU does not have infinite amount of work items. Chances are, you can have a few thousands work items in parallel. So for 1M elements, reduction takes a few "iterations" internally in the GPU. Relaunching another iteration with different parameters should be cheap, with minimal overhead. As long as you are not doing a clFinish() to block the pipeline. You can read section 3, two stage reduction: http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/ — DarkZeros, May 02 '16 at 01:14

Andrew · Answer 1 · 2016-05-03T00:24:12.573

Sorry for previous code. also It has problem.

CLK_GLOBAL_MEM_FENCE effects only current workgroup. I confused. =[

If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).

~~Plaese just take concept.~~

__kernel void sumGPU ( __global const double *input,
                       __global double *partialSums,
               __local double *localSums)
  {
 uint local_id = get_local_id(0);
 uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];

    barrier(CLK_GLOBAL_MEM_FENCE);

      if(get_group_id(0)==0){
          if(local_id < get_num_groups(0)){  // 16384
            for(int n=0 ; n<get_num_groups(0) ; n+= group_size )
               localSums[local_id] += partialSums[local_id+n];
            barrier(CLK_LOCAL_MEM_FENCE);

            for(int s=group_size/2;s>0;s/=2){
               if(local_id < s)
                  localSums[local_id] += localSums[local_id+s];
               barrier(CLK_LOCAL_MEM_FENCE);
            }
            if(local_id == 0)
               partialSums[0] = localSums[0];
          }
       }
 }

thanks. In your code, is there a little error ?, you mean : " if(get_global_id(0) < s) " instead of "if (get_global_id(0) < i) " ? — , Apr 29 '16 at 12:21
This code would not work. It assumes global item sincronization, which is inexistent in OpenCL. When work item 0,0 gets to the partial sum part, the work group that should compute (0+s) has probably not even started yet — DarkZeros, May 02 '16 at 01:09
@DarkZeros, Your right. It has global memory synchronization problem. I'll modify it. — Andrew, May 02 '16 at 21:34

huseyin tugrul buyukisik · Accepted Answer · 2016-12-02T15:41:47.337

If that float's order of magnitude is smaller than exa scale, then:

Instead of

if (local_id == 0)
  partialSums[get_group_id(0)] = localSums[0];

You could use

if (local_id == 0)
{
    if(strategy==ATOMIC)
    {
        long integer_part=getIntegerPart(localSums[0]);
        atom_add (&totalSumIntegerPart[0] ,integer_part);
        long float_part=1000000*getFloatPart(localSums[0]);
         // 1000000 for saving meaningful 7 digits as integer
        atom_add (&totalSumFloatPart[0] ,float_part);
    }
}

this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:

   float value=0;
   if(strategy==ATOMIC)
   {
       float float_part=getFloatPart_(totalSumFloatPart[0]);
       float integer_part=getIntegerPart_(totalSumFloatPart[0])
       + totalSumIntegerPart[0];
       value=integer_part+float_part;
   }

just a few atomic operations shouldn't be effective on whole kernel time.

Some of these get___part can be written easily already using floor and similar functions. Some need a divide by 1M.

Strategy for doing final reduction

2 Answers2

Linked