While playing with OpenCL, I encountered a bug that I can't explain. Below is a reduction algorithm simply adapted to GPU like accelerators. You can see two versions of the reduction algorithm. V0 uses shared memory. V1 uses the work_group_reduce_<> feature of OpenCL 2.0.
V0 fails when I use a work group larger than 64. Note that the shared memory array contains 128 unsigned int
and I change this capacity to fit the size of the work group.
The V1 code works as expected for various work group sizes, at least powers of two and > 64.
V0 fails for work group sizes > 64. For example, the code below expects to run on work groups of 128 work items. Now, if I have the if(x_local_coordinate < i)
check before the shared memory reduction, the code works as expected.
I am interested to know why V0 does not work as expected when what seems to me to be a redundant check (if(x_local_coordinate < i)
) is not used and the kernel is run with a work group size > 64.
I have looked at similar implementation without understanding why the code below does not work. Maybe it comes from the host but then why does V0 always works.
Thank you.
typedef unsigned int Type;
kernel void ReduceBulk(global const Type* restrict buffer,
unsigned long buffer_size,
global Type* restrict reduction,
unsigned long sum_per_thread) {
const unsigned long x_coordinate = get_global_id(0);
const unsigned long x_dimension_size = get_global_size(0);
const unsigned long x_local_dimension_size = get_local_size(0);
const unsigned long x_local_coordinate = get_local_id(0);
Type sum = 0;
for(unsigned long i = 0; i < sum_per_thread; ++i) {
const unsigned long index = x_local_dimension_size * i + x_coordinate; // Coalesced
// Dont check if out of bound
// if(index >= buffer_size) {
// break;
// }
sum += buffer[index];
}
// V0, fails when WGs is > 64.
local Type a_scratch_buffer[128];
a_scratch_buffer[x_local_coordinate] = sum;
if(x_local_coordinate < (x_local_dimension_size / 2)) {
for(unsigned long i = x_local_dimension_size / 2; i != 0; i /= 2) {
barrier(CLK_LOCAL_MEM_FENCE);
// if(x_local_coordinate < i) {
// Without this additional check (redundant due to the one before the for loop) we get wrong result when
// using WGs > 64
a_scratch_buffer[x_local_coordinate] += a_scratch_buffer[x_local_coordinate + i];
// }
}
if(x_local_coordinate == 0) {
atomic_add(reduction, a_scratch_buffer[0]);
}
}
// // V1
// barrier(0);
// sum = work_group_reduce_add(sum);
// if(x_local_coordinate == 0) {
// atomic_add(reduction, sum);
// }
}