Is global synchronization in OpenCL possible?

Question

As well known OpenCL barrier() function works only for single workgroup, and there is no direct possibility to synchronize workgroups. If it possible what's best approach for global synchronization today? Using atomics, OpenCL 2.0 features, etc.?

Github links, examples are welcome!

Thankx!

As said above: it is not possible. You can always find problems to fit into one work group and have at least local synchronization. But of course, if the problem size expands, this won't work anymore. And it won't run on different hardware without adjusting your problem size. OpenCL 2.0 offers the feature of kernel enqueued calls. This might reduce some overhead, if you need host based synchronization. But it is no general solution to all kind of problems. — Christian, May 13 '15 at 09:51
You can try to divide your program into more kernels and synchronize it through command queue. It's effective especially when you don't need to remember variables inside kernel and you can calculate them. If you need to store variables, then you can use global arrays of structs/vectors to transport variables between kernels. — blind.wolf, May 14 '15 at 12:24
Another thing is that there is no example or at least description of algorithm, that you are trying to paralelize, so it's hard to tell what to do. i recommend to look again on that data pattern, because such huge synchronizations are suspicious. Your problem might have another solution or it's better to use something else than OpenCL. For example SSE, multithreading, OpenMP,... — blind.wolf, May 14 '15 at 12:35

score 5 · Answer 1 · answered May 17 '15 at 17:09

Global syncronization within a kernel is not possible. This is because work groups are not gauranteed to be run at the same time. You can achieve a sort of global sync in the host application if you break your kernel into pieces. This is not suitable for many kernels, espeically if you use a lot of local memory or have a bit of initialization code before your kernel does any real work.

Break you kernel into two pars -- kernelA and kernelB for example. Global syncronization is simply a matter of running the NDRange for kernelA, then finish(), and NDRange for kernelB. The global data will remain in memory between the two calls.

Again, not pretty and not necessarily high performance, but if you really must have global sync, this is the only way to get it.

score 4 · Answer 2 · edited May 23 '17 at 12:33

While global synchronization has no succinct in-kernel API call, if the compute device supports the OpenCL extension cl_khr_global_int32_base_atomics, it may be implemented using atomics.

Please see Xiao et al.'s paper that evaluates lock and lock-free approaches to global synchronization on GPUs. http://synergy.cs.vt.edu/pubs/papers/xiao-ipdps2010-gpusync.pdf

This is mentioned in another stackoverflow post found here: OpenCL and GPU global synchronization

score 0 · Answer 3 · answered Aug 04 '20 at 12:25

If a command_queue is configured for in-order processing, global syncronisation can be achieved through the ordering of sequential kernels. There is no explicit barrier() call, just kernel1 which is enqueued prior to kernel2. If the command queue is configured for in-order processing, kernel1 will complete all work before kernel2 starts. You will need to have a buffer shared between the two kernels to pass information between them.

In-order processing is the default. There is no need to call finish() between kernels.

The command queue can be configured with clCreateCommandQueueWithProperties and setting the properties to CL_QUEUE_OUT_OF_ORDER_EXEC_MODE if out of order queue execution is required. In that case finish() is would be required to ensure synchronisation.

Is global synchronization in OpenCL possible?

3 Answers3

Linked