1

So I have written a parallel reduction sum on the gpu in the global memory, because my gpu does not have shared memory(I believe this means that I cant use local memory?). Problem is when I try to add more than 1024*4 number of numbers it begins to output the wrong solution, usually its off by a few hundred to a few thousand depending on how many numbers I input. What could the reason be? A is the input, C is the output.

  __kernel void GMM(__global float *A, __global float *B, __global float *C) 
{
uint global_id =get_global_id(0);
uint group_size=get_global_size(0);

B[global_id]=A[global_id];
for(int stride = group_size/2;stride>0;stride /=2)
{ 
    if(global_id<stride)
    {
         B[global_id]+=B[global_id+stride];
    } 
}
if(global_id == 0)
C[get_group_id(0)]=B[0];
}
Fluffy
  • 13
  • 3
  • How many workgroups are you scheduling and what is the workgroup size? – doqtor Mar 29 '16 at 16:31
  • I have a global size of 1024*4 and a local size of 2 so 2048 workgroups. – Fluffy Mar 29 '16 at 17:37
  • I guess I'm entering some sort of cap and end up adding random global values that have already been added when I exceed 1024*4 global size? – Fluffy Mar 29 '16 at 17:54
  • Can you write a working and a non-workin code please. With global and local numbers. – huseyin tugrul buyukisik Mar 29 '16 at 21:37
  • A reduction like this cannot work without synchronization. There is no guarantee as to the order in which the work-items will execute. Furthermore there is [no global sync](http://stackoverflow.com/questions/30209996/is-global-synchronization-in-opencl-possible) mechanism in OpenCL. So you will need to construct a reduction that is work-group aware. – Robert Crovella Mar 30 '16 at 02:21
  • Your conclusion that your device has no local memory may be incorrect. I commented on your other question. – Robert Crovella Mar 30 '16 at 02:32
  • I am not 100% sure that I do not have shared memory. But according to this page http://www.notebookcheck.com/NVIDIA-GeForce-GT-635M.66963.0.html shared mem= no. However I tried to create a local float inside my kernel(not as a kernel arg). And I was able to set a value to that float and print it out, so I'm a bit confused. – Fluffy Mar 30 '16 at 07:12
  • The "shared mem" in that link is not the same as "shared memory" i.e. OpenCL local memory. Your GT635M **does** have OpenCL local memory. – Robert Crovella Mar 30 '16 at 14:29

1 Answers1

0

solved it apparently I do have shared memory. And by using __local memory and local barriers the solutions are consistent and correct!

Fluffy
  • 13
  • 3