In OpenGL 2.1 a work group is subdivided into subgroups. work_group_barrier()
synchronizes all the work items in a work group, sub_group_barrier()
only the work items in one subgroup.
Is it possible to synchronize the work items in a range of subgroups?
For example a work group consists of 5 subgroups, each containing 64 work items. Subgroups 0 and 1 (= work items 0 - 128) should synchronize, so that after the barrier work items from subgroup 0 can access data written by subgroup 1). At the same time subgroups 2, 3 and 4 could continue without participating in this sychronization, possibly executing a different part of code.
In CUDA this is possible for warps (equivalent of subgroup, 32 threads), using inline PTX assembly: CUDA: how to use barrier.sync
Is there a way to do this with OpenCL on the AMD platform, possibly using inline assembly code as well? If not, is there another GPGPU API/language for the AMD platform that would allow this?