I take up the continuation of my first issue explained on this link.
I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,
Currently, I did a version for only one sum reduction (i.e one iteration ). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.
From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel
function (i.e executing a second time the kernel code).
Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel
) : it seems to be a recursive issue.
Let's take an example from the above figure : if input array size is 10240000
and WorkGroup size
is 16
, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups
.
So after the first call, I get 640000 partial sums
: how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16
and global size = 640000
, I will get nWorkGroups = 640000/16 = 40000 partial sums
, so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size
.
Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
...
If someone could explain what this above code snippet of kernel code does.
Is there a relation with the second stage of reduction, i.e the final sumation ?
Feel free to ask me more details if you have not understood my issue.
Thanks