I have some question about how GPUs perform synchronizations. As I know, when a warp encounters a barrier (assuming it is in OpenCL), and it knows that the other warps of the same group haven't been there yet. So it has to wait. But what exactly does that warp do during the waiting time? Is it still an active warp? Or will it do some kind of null operations?
As I notice, when we have a synchronization in the kernel, the number of instructions increases. I wonder what is the source of this increment. Is the synchronization broken down into that many smaller GPU instructions? Or because the idle warps perform some extra instructions?
And finally, I strongly wonder if the cost added by a synchronization, compared to one without synch, (let's say barrier(CLK_LOCAL_MEM_FENCE)) is affected by the number of warp in a workgroup (or threadblock)? Thanks