Parallel reduction sum on gpu computes wrong opencl

Question

So I have written a parallel reduction sum on the gpu in the global memory, because my gpu does not have shared memory(I believe this means that I cant use local memory?). Problem is when I try to add more than 1024*4 number of numbers it begins to output the wrong solution, usually its off by a few hundred to a few thousand depending on how many numbers I input. What could the reason be? A is the input, C is the output.

  __kernel void GMM(__global float *A, __global float *B, __global float *C) 
{
uint global_id =get_global_id(0);
uint group_size=get_global_size(0);

B[global_id]=A[global_id];
for(int stride = group_size/2;stride>0;stride /=2)
{ 
    if(global_id<stride)
    {
         B[global_id]+=B[global_id+stride];
    } 
}
if(global_id == 0)
C[get_group_id(0)]=B[0];
}

How many workgroups are you scheduling and what is the workgroup size? — doqtor, Mar 29 '16 at 16:31
I have a global size of 1024*4 and a local size of 2 so 2048 workgroups. — Fluffy, Mar 29 '16 at 17:37
I guess I'm entering some sort of cap and end up adding random global values that have already been added when I exceed 1024*4 global size? — Fluffy, Mar 29 '16 at 17:54
Can you write a working and a non-workin code please. With global and local numbers. — huseyin tugrul buyukisik, Mar 29 '16 at 21:37
A reduction like this cannot work without synchronization. There is no guarantee as to the order in which the work-items will execute. Furthermore there is [no global sync](http://stackoverflow.com/questions/30209996/is-global-synchronization-in-opencl-possible) mechanism in OpenCL. So you will need to construct a reduction that is work-group aware. — Robert Crovella, Mar 30 '16 at 02:21
Your conclusion that your device has no local memory may be incorrect. I commented on your other question. — Robert Crovella, Mar 30 '16 at 02:32
I am not 100% sure that I do not have shared memory. But according to this page http://www.notebookcheck.com/NVIDIA-GeForce-GT-635M.66963.0.html shared mem= no. However I tried to create a local float inside my kernel(not as a kernel arg). And I was able to set a value to that float and print it out, so I'm a bit confused. — Fluffy, Mar 30 '16 at 07:12
The "shared mem" in that link is not the same as "shared memory" i.e. OpenCL local memory. Your GT635M **does** have OpenCL local memory. — Robert Crovella, Mar 30 '16 at 14:29

score 0 · Answer 1 · answered Mar 30 '16 at 09:44

0

solved it apparently I do have shared memory. And by using __local memory and local barriers the solutions are consistent and correct!

answered Mar 30 '16 at 09:44

Fluffy

13
3

Parallel reduction sum on gpu computes wrong opencl

1 Answers1