Method to do final sum with reduction

Question

I take up the continuation of my first issue explained on this link.

I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,

Currently, I did a version for only one sum reduction (i.e one iteration ). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.

From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (i.e executing a second time the kernel code).

Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel) : it seems to be a recursive issue.

Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.

So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000, I will get nWorkGroups = 640000/16 = 40000 partial sums, so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size.

Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  ...

If someone could explain what this above code snippet of kernel code does.

Is there a relation with the second stage of reduction, i.e the final sumation ?

Feel free to ask me more details if you have not understood my issue.

Thanks

I think this computation of the 640000 work groups is not right. I also had som difficulties understanding this article first. [This answer](http://stackoverflow.com/a/10975985/3182664) helped a bit: It makes clear that there will be *one* result for each work group. Even a large number of elements is reduced into "few" elements, which can then be reduced on the host. — Marco13, May 07 '16 at 00:15
@Marco13 why do you think that computation of 640000 work groups is not right ? — , May 07 '16 at 01:19
I think that you can define an "arbitrary" work group size. But for 10240000 elements, you may, for example, use 64 workgroups, which will compute 64 results, where each work group will take care of 10240000/64 elements (but admittedly, I'd have to refresh my memories here in order to make a more definite statement - sorry, that's why I only commented) — Marco13, May 07 '16 at 01:30
ok I see what you mean but, in your example with "10240000/64 = size of each work group ", I think I am over the limits of each workgroup size supported by my GPU device (into my header cl.h, I have " #define CL_DEVICE_MAX_WORK_GROUP_SIZE 0x1004 ") — , May 07 '16 at 01:38
The constant `CL_DEVICE_MAX_WORK_GROUP_SIZE` is **not** the maximum size of the workgroup. It is only the name of the parameter that can be queried using `clGetDeviceInfo` to obtain the maximum work group size: After `size_t theSize=0; clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &theSize, null);`, the variable `theSize` will contain the maximum work group size. — Marco13, May 07 '16 at 18:28

score 4 · Accepted Answer · answered May 07 '16 at 18:39

4

As mentioned in the comment: The statement

if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.

is not correct. You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. The numbers to choose here may be tailored for the target device. For example, the device may have a certain local memory size. This can be queried with clDeviceGetInfo:

cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE, 
    sizeof(cl_ulong), &localMemSize, nullptr);

This may be used to compute the size of a local work group, considering the fact that each work group will require

sizeof(cl_float) * workGroupSize

bytes of local memory.

Similarly, the number of work groups may be derived from other device specific parameters.

The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed. I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words:

As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). The result will be one accumulated value for each of the 24 threads. These are then reduced in parallel.

answered May 07 '16 at 18:39

Marco13

53,703
9
80
159

Thanks, I realized that first loop could simplify the problem, i.e leads from the case where I have a N input array size to a "get_global_size" (the total number of threads) situation. Regards – May 08 '16 at 05:41
Would it be possible to reduce the final ```result array``` on the device (gpu) instead of transferring it to the host (cpu)? – Kevin Apr 18 '21 at 15:10
@Kevin I don't see a reason why it should not be possible in general, but 1. I'd have to think about how this could be done, and 2. (more importantly) **if** this was "easy" (and would bring a good performance benefit), the AMD engineers would probably have done that. My gut feeling is that it would not bring a speedup, because it's memory-bound: A lot of memory has to be *read*, just to do the pairwise `min` operations, meaning that the GPU will be busy with reading/writing data most of the time. – Marco13 Apr 30 '21 at 23:55
@Marco13 Thanks, I have a case where the result array is very large because I am only reducing certain pairs by reducing over specific axes for a multidimensional array. This means that the reduced array will not be a single scalar value and transferring the array back and forth is quite expensive in my case and it also feels counterintuitive if I am doing multiple operations on the gpu. I cannot seem to find any sources regarding the most efficient way of reducing an array over certain axes in OpenCL. – Kevin May 01 '21 at 14:54
Something similar to how numpy ```min``` would work: https://numpy.org/doc/stable/reference/generated/numpy.amin.html – Kevin May 01 '21 at 14:54
@Kevin Sorry, I don't know an off-the-shelf-solution for that, and most likely will not have the time to dig deeper into that. The complexity that is added by multiple dimensions (and maybe arbitrary slicing-and-dicing, by reducing along arbitrary axes) is considerable, and raises questions about the "simple" solution here in general (e.g. the solution here is efficient due to memory coalescing, which may not be the case for a "strided" access). – Marco13 May 02 '21 at 11:56

Method to do final sum with reduction

1 Answers1