Synchronizations in GPUs

Question

I have some question about how GPUs perform synchronizations. As I know, when a warp encounters a barrier (assuming it is in OpenCL), and it knows that the other warps of the same group haven't been there yet. So it has to wait. But what exactly does that warp do during the waiting time? Is it still an active warp? Or will it do some kind of null operations?

As I notice, when we have a synchronization in the kernel, the number of instructions increases. I wonder what is the source of this increment. Is the synchronization broken down into that many smaller GPU instructions? Or because the idle warps perform some extra instructions?

And finally, I strongly wonder if the cost added by a synchronization, compared to one without synch, (let's say barrier(CLK_LOCAL_MEM_FENCE)) is affected by the number of warp in a workgroup (or threadblock)? Thanks

Tom · Accepted Answer · 2011-07-13T12:04:43.933

An active warp is one that is resident on the SM, i.e. all the resources (registers etc.) have been allocated and the warp is available for executing providing it is schedulable. If a warp reaches a barrier before other warps in the same threadblock/work-group it will still be active (it is still resident on the SM and all its registers are still valid), but it won't execute any instructions since it is not ready to be scheduled.

Inserting a barrier not only stalls execution but also acts as a barrier for the compiler: the compiler is not allowed to perform most optimisations across the barrier since this may invalidate the purpose of the barrier. This is the most likely reason you are seeing more instructions - without the barrier the compiler is able to perform more optimisations.

The cost of a barrier is very dependent on what your code is doing, but each barrier introduces a bubble where all warps have to (effectively) become idle before they all start work again, so if you have a very large threadblock/work-group then of course there is potentially a bigger bubble than with a small block. The impact of the bubble depends on your code - if your code is very memory bound then the barrier will expose the memory latencies where before they may have been hidden, but if more balanced then it may have a less noticeable effect.

This means that in a very memory-bound kernel you may be better off launching a larger number of smaller blocks so that other blocks can be executing when one block is bubbling on a barrier. You would need to ensure that your occupancy increases as a result, and if you are sharing data between threads using the block-shared-memory then there is a trade-off to be had.

Thanks for the detailed answer. It would be nice if you could share some documents that you got the knowledge from. I would like to cite in my research. Could you explain more why the memory bound kernels expose memory latencies? As I understand now, a memory request near a sync (appears before the sync) supposed to be hidden by some computation will be stalled until the data arrive. Is it correct? On the other hand, if kernel is not mem-bound, what does a sync expose? instruction pipeline latency? (supposed no divergences, and yes what do all of this do with divergences?) — Zk1001, Jul 13 '11 at 15:03

Synchronizations in GPUs

1 Answers1