From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add
) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function
(i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups]
containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?