0

I have an opencl kernel that finds the maximum ASCII character in a string. The problem is I cannot synchronize the multiple read-writes to global and local memories. I am trying to update a local_maximum character in shared memory, and at the end of the workgroup (last thread), the global_maximum character, by comparing it with the local_maximum. The threads are writing one over another, I guess.

eg: Input string: "pirates of the carribean".

Output String: 'r' (but it should be 's').

Please have a look at the code and give a solution as to what I can do to get everything synchronized. I am sure people having sound knowledge can understand the code. Optimization tips are welcome.

The code is below:

__kernel void find_highest_ascii( __global const char* data, __global char* result, unsigned int size,  __local char* localMaxC )
{
//creating variables and initialising..
unsigned int i, localSize, globalSize, j;
char privateMaxC,temp,temp1;

i = get_global_id(0);
localSize = get_local_size(0);
globalSize = get_global_size(0);

privateMaxC = '\0';

if(i<size){
if(i == 0)
read_mem_fence( CLK_LOCAL_MEM_FENCE );
*localMaxC = '\0';
mem_fence( CLK_LOCAL_MEM_FENCE);

////////////////////////////////////////////////////
/////UPDATING PRIVATE MAX CHARACTER/////////////////
////////////////////////////////////////////////////

for( j = i; j<size; j+=globalSize )
{
    if( data[j] > privateMaxC )
    {
        privateMaxC = data[j];
    }
}

///////////////////////////////////////////////////


///////////////////////////////////////////////////
////UPDATING SHARED MAX CHARACTER//////////////////
///////////////////////////////////////////////////

temp = *localMaxC;
read_mem_fence( CLK_LOCAL_MEM_FENCE );

if(privateMaxC>temp)
{
    *localMaxC = privateMaxC;
    write_mem_fence( CLK_LOCAL_MEM_FENCE );
    temp = privateMaxC;
}

//////////////////////////////////////////////////


//UPDATING GLOBAL MAX CHARACTER.

temp1 = *result;

if(( (i+1)%localSize == 0 || i==size-1) && (temp > temp1 ))
{
            read_mem_fence( CLK_GLOBAL_MEM_FENCE );
    *result = temp;
    write_mem_fence( CLK_GLOBAL_MEM_FENCE );
}


 }
}
mb1994
  • 241
  • 3
  • 13

1 Answers1

1

You are correct that threads will be overwriting each other's values, since your code is riddled with race conditions. In OpenCL, there is no way to synchronise between work-items that are in different work-groups. Instead of trying to achieve this kind of synchronisation with explicit fences, you can make your code much simpler by using the built-in atomic functions instead. In particular, there is an atomic_max built-in which solves your problem perfectly.

So, instead of the code you currently have to update both your local and global memory maximum values, just do something like this:

kernel void ascii_max(global int *input, global int *output, int size,
                      local int *localMax)
{
  int i = get_global_id(0);
  int l = get_local_id(0);

  // Private reduction                                                          
  int privateMax = '\0';
  for (int idx = i; idx < size; idx+=get_global_size(0))
  {
    privateMax = max(privateMax, input[idx]);
  }

  // Local reduction                                                            
  atomic_max(localMax, privateMax);
  barrier(CLK_LOCAL_MEM_FENCE);

  // Global reduction                                                           
  if (l == 0)
  {
    atomic_max(output, *localMax);
  }
}

This will require you to update your local memory scratch space and final result to use 32-bit integer values, but on the whole is a significantly cleaner approach to solving this problem (not to mention it actually works).


NON-ATOMIC SOLUTION

If you really don't want to use atomics, then you can implement a bog-standard reduction using local memory and work-group barriers. Here's an example:

kernel void ascii_max(global int *input, global int *output, int size,
                      local int *localMax)
{
  int i = get_global_id(0);
  int l = get_local_id(0);

  // Private reduction                                                          
  int privateMax = '\0';
  for (int idx = i; idx < size; idx+=get_global_size(0))
  {
    privateMax = max(privateMax, input[idx]);
  }

  // Local reduction                                                            
  localMax[l] = privateMax;
  for (int offset = get_local_size(0)/2; offset > 1; offset>>=1)
  {
    barrier(CLK_LOCAL_MEM_FENCE);
    if (l < offset)
    {
      localMax[l] = max(localMax[l], localMax[l+offset]);
    }
  }

  // Store work-group result in global memory                                   
  if (l == 0)
  {
    output[get_group_id(0)] = max(localMax[0], localMax[1]);
  }
}

This compares pairs of elements at a time using local memory as a scratch space. Each work-group will produce a single result, which is stored in global memory. If your data-set is small, you could run this with a single work-group (i.e. make global and local sizes the same), and this will work just fine. If it is larger, you could run a two-stage reduction by running this kernel twice, e.g.:

size_t N = ...; // something big

size_t local  = 128;
size_t global = local*local; // Must result in at most 'local' number of work-groups

// First pass - run many work-groups using temporary buffer as output
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_temp);
clEnqueueNDRangeKernel(..., &global, &local, ...);

// Second pass - run one work-group with temporary buffer as input
global = local;
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_temp);
clSetKernelArg(kernel, 1, sizeof(cl_mem), d_output);
clEnqueueNDRangeKernel(..., &global, &local, ...);

I'll leave it to you to run them and decide which approach would be best for your own data-set.

jprice
  • 9,755
  • 1
  • 28
  • 32
  • Thanks, worked. :) I had a few doubts in what you've said though. 1.why are you updating the global maximum with the first thread? the local maximum isnt even updated yet. should be done when (get_local_id(0) == get_local_size(0)-1), isnt it? 2.Do you have any tips on improving the algorithm here? Also, atomic writes are slow as far as I know. I'm sure you've understood the algorithm. Please reply with any tips. :) – mb1994 Mar 17 '14 at 12:54
  • There's a work-group barrier with a local memory fence, so all of the local atomic operations within that work-group will have finished by the time the global atomic update happens. Atomic operations can often be a lot faster than the alternative, since lots of hardware has native support for atomics. If you aren't convinced, you could try implementing a local memory reduction which compares pairs of values at a time to produce a single maximum for each work-group, with a second stage to reduce each work-group result into a single result (search for "OpenCL parallel reduction for examples"). – jprice Mar 17 '14 at 13:07
  • This answer is not only simpler but is the way to go. Private->Local->Global. Should not have big performance penalties as long as the Private and Local sizes are big enough to hide the atomic overhead. Performance wise, this is the way to go as well. – DarkZeros Mar 17 '14 at 13:44
  • Another thing to note: Any barriers *must* be encountered by every work item in a work group. You have one under a conditional, which could cause a hang in your kernel or at least incorrect results. – Dithermaster Mar 17 '14 at 18:27
  • are you talking about the one when i==0? i forgot to insert the brackets there. those 3 statements are under the if. Another question: are we bound to use atomics here? is there any other way of solving this problem of synchronization? – mb1994 Mar 18 '14 at 14:46
  • You don't *have* to use atomics, but it's by far the best solution for this. If you really, really, *really* don't want to use atomics, then you'll need to perform a multi-stage reduction: one kernel to reduce values in local memory with one result per work-group, and a second kernel to reduce each of these results into a single value in global memory. If you only ever have a small set of input data, you could potentially just have a one stage reduction that only uses a single work-group, but this won't utilise the whole device. – jprice Mar 18 '14 at 15:02
  • @jprice: had thought of this algorithm as well. But here again, we need to find "one" local maximum, so it would require atomic operations here as well. Basically, we will __have__ to use atomic operations somewhere or the other, isnt it? is there no other alternative that opencl provides? I've been searching all over the internet, but with little output. – mb1994 Mar 18 '14 at 17:46
  • @user3139310 No, you can use work-group barriers to synchronise in local memory. It's *global memory* that you can't synchronise. I've added a full example of a non-atomic reduction to my answer, which I've tested on my own machine. These things are very standard; if you search for "OpenCL parallel reduction" you'll get a whole bunch of example that explain many ways of doing this. – jprice Mar 18 '14 at 21:05
  • You gave me a beautiful insight into the problem. Thanks a lot for that. :-) – mb1994 Mar 19 '14 at 10:08