The compute shader execution model allows the number of invocations to (greatly) exceed the number of individual execution units in a warp/wavefront. For example, hardware warp/wavefront sizes tend to be between 16 and 64, while the number of invocations within a work group (GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS
) is required in OpenGL to be no less than 1024.
barrier
calls and using shared
variable data when a work group spans multiple warps/wavefronts works essentially by halting the progress of all warps/wavefronts until they each have passed that particular point. And then performing various memory flushing so that they can access each others' variables (based on memory barrier usage, of course). If all of the invocations in a work group fit into a single warp, then it's possible to avoid such things.
Basically, you have no control over how CS invocations are grouped into warps. You can assume that the implementation is not trying to be slow (that is, it will generally group invocations from the same work group into the same warp), but you cannot assume that all invocations within the same work group will be in the same warp.
Nor should you assume that each warp only executes invocations from the same work group.