OpenCL and GPU global synchronization

Question

Has anyone tried the gpu_sync functions described in the article "Inter-Block GPU Communication via Fast Barrier Synchronization"? All the codes described seems pretty simple and easy to implement but it keeps freezing up my GPU. I'm sure I'm doing something stupid but I can't see what. Can anyone help me?

The strategy I'm using is the one described in the section “GPU Lock-Free Synchronization” and here is the OpenCL source code I've implemented:

static void globalSync(uint iGoalValue,
                   volatile __global int *globalSyncFlagsIN,
                   volatile __global int *globalSyncFlagsOUT)
{
 const size_t iLocalThreadID  = get_local_id(0);
 const size_t iWorkGroupID    = get_group_id(0);
 const size_t iWorkGroupCount = get_num_groups(0);

 //Only the first thread on each SM is used for synchronization
 if (iLocalThreadID == 0)
 { globalSyncFlagsIN[iWorkGroupID] = iGoalValue; }

 if (iWorkGroupID == 0)
 {
  if (iLocalThreadID < iWorkGroupCount)
  {
   while (globalSyncFlagsIN[iLocalThreadID] != iGoalValue) {
    // Nothing to do here
   }
  }

  barrier(CLK_GLOBAL_MEM_FENCE);

  if (iLocalThreadID < iWorkGroupCount)
  { globalSyncFlagsOUT[iLocalThreadID] = iGoalValue; }
 }

 if (iLocalThreadID == 0)
 {
  while (globalSyncFlagsOUT[iWorkGroupID] != iGoalValue) {
   // Nothing to do here 
  }
 }

 barrier(CLK_GLOBAL_MEM_FENCE);
}

Thanks in advance.

Sergei Kulik · Answer 1 · 2015-12-27T06:24:23.943

2

I haven't tried running the code, but the direct translation from CUDA to OpenCL of the code from the article mentioned above would be:

{  
    int tid_in_blk = get_local_id(0) * get_local_size(1)
        + get_local_id(1);
    int nBlockNum = get_num_groups(0) * get_num_groups(1);
    int bid = get_group_id(0) * get_num_groups(1) + get_group_id(1);


    if (tid_in_blk == 0) {
        Arrayin[bid] = goalVal;
    }

    if (bid == 1) {
        if (tid_in_blk < nBlockNum) {
            while (Arrayin[tid_in_blk] != goalVal){

            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        if (tid_in_blk < nBlockNum) {
            Arrayout[tid_in_blk] = goalVal;
        }
    }

    if (tid_in_blk == 0) {
        while (Arrayout[bid] != goalVal) {

        }
    }
}

Please note the difference in thread and group IDs and in using local memory barrier instead of global one.

edited Dec 27 '15 at 06:24

answered Dec 27 '15 at 06:12

Sergei Kulik

344
2
9

Just having a second thought regarding my last sentence. Arrayin and Arrayout should be in __global memory, of course, so that threads from different blocks could communicate through it. The memory fence should still be local as only threads from group 1 are synchronized. I edited the answer. – Sergei Kulik Dec 27 '15 at 06:23
Also, make sure that the lengths of the arrays are at least equal to the number of blocks. – Sergei Kulik Dec 27 '15 at 06:26
My doubt is about the last sync barrier described in the article. I've managed to make the algorithm work by removing it from the code, but why the authors suggest its use? It's really necessary? – Walid Dec 30 '15 at 12:20
Somehow I've missed that last __synchthreads() in the article. Or perhaps I've been looking through a different version of the article. As far as I can tell, the code (I mean the global synchronization) MUST work as expected even without explicit inter-group synchronization at the end. The authors might have added it to align the threads inside groups before returning from the function which may be a good thing. – Sergei Kulik Jan 05 '16 at 05:21
Funny thing is: the algorithm runs perfectly on my 5870m but freezes my R9 290X. Exactly the same code and two different behaviors. Do you have any idea about what is going on? – Walid Jan 07 '16 at 23:40

score 0 · Answer 2 · answered May 17 '23 at 08:05

Must be too late but just for the reference. Unfortunately this is not going to work because barrier() only works across the work items in the same workgroup. i.e., user can only specify the address_space but not memory_scope. The builtin has been renamed as work_group_barrier to avoid that confusion. (barrier() is still supported for the backward compatibility) https://registry.khronos.org/OpenCL/sdk/2.0/docs/man/xhtml/work_group_barrier.html atomic extensions might be helpful, such as atomic_inc on __global.

OpenCL and GPU global synchronization

2 Answers2

Linked