Are OpenCL workgroups executed simultaneously?

Question

My understanding was, that each workgroup is executed on the GPU and then the next one is executed.

Unfortunately, my observations lead to the conclusion that this is not correct. In my implementation, all workgroups share a big global memory buffer. All workgroups perform read and write operations to various positions on this buffer.

If the kernel operate on it directly, no conflicts arise. If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.

So how can I avoid this behaviour?

Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?

score 1 · Accepted Answer · answered Aug 18 '15 at 15:48

The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.

You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.

If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.

score 1 · Answer 2 · edited May 23 '17 at 11:52

You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.

This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.

Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.

score 0 · Answer 3 · answered Sep 01 '15 at 00:37

The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these. I am not aware of any API which lets you pin a kernel to a single CU/SMX.

** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.

Are OpenCL workgroups executed simultaneously?

3 Answers3