The reduction method suggested by NVIDIA uses __syncthreads()
inside conditional branching e.g.:
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
or
for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
In the second example __syncthreads()
is inside for
loop body, which is also a conditional branch.
However, a number of questions on SO raise the problem of __syncthreads()
inside conditional branches (e.g. Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ), and the answers say that __syncthreads()
in conditional branches may lead to a deadlock. Consequently, reduction method suggested by NVIDIA may deadlock (if believing the documentation on which the answers are based).
Furthermore, if _syncthreads()
can't be used inside conditional branches, then I'm afraid that many of the basic operations are blocked and reduction is just an example.
So how to do reduction in CUDA without using __syncthreads()
in conditional branches? Or is it a bug in the documentation?