How to reduce in CUDA if __syncthreads can't be called inside conditional branches?

Question

The reduction method suggested by NVIDIA uses __syncthreads() inside conditional branching e.g.:

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

or

for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
    if (tid < s)
        sdata[tid] += sdata[tid + s];
    __syncthreads();
}

In the second example __syncthreads() is inside for loop body, which is also a conditional branch.

However, a number of questions on SO raise the problem of __syncthreads() inside conditional branches (e.g. Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ), and the answers say that __syncthreads() in conditional branches may lead to a deadlock. Consequently, reduction method suggested by NVIDIA may deadlock (if believing the documentation on which the answers are based).

Furthermore, if _syncthreads() can't be used inside conditional branches, then I'm afraid that many of the basic operations are blocked and reduction is just an example.

So how to do reduction in CUDA without using __syncthreads() in conditional branches? Or is it a bug in the documentation?

Note that in the first code snippet, Blcoksize is a compile time constant or template parameter, so there is no branch emitted by the compiler. Note that in the second the loop trip count is constant for every thread in the block, so again, there is no conditional branching around a `__syncthreads` call. — talonmies, Sep 13 '16 at 06:55

Angew is no longer proud of SO · Accepted Answer · 2016-09-14T07:36:51.533

7

The limitation is not

__syncthreads cannot be used in conditional branches

The limitation is

__syncthreads cannot be used in branches which will not be traversed by all threads at the same time

Notice that in both the examples you give, __syncthreads is not covered by a condition that would depend on the thread ID (or some per-thread data). In the first case, blockSize is a template parameter which does not depend on thread ID. In the second case, it's likewise after the if.

Yes, the for loop's s > 32 is a condition, but it is a condition whose truth value does not depend on the thread or its data in any way. blockdim.x is the same for all threads. And all threads will execute exactly the same modifications of s. Which means that all threads will reach the __syncthreads in exactly the same point of their control flow. Which is perfectly OK.

The other case, where you cannot use __syncthreads, is a condition which can be true for some threads and false for other ones. In such case, you have to close all conditions to use __syncthreads. So instead of this:

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
  __syncthreads();
  operation2();
}

You must do this:

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
}
__syncthreads();
if (threadIdx.x < SOME_CONSTANT)
{
  operation2();
}

Both of the examples you gave demonstrate this too: the thread-ID-dependent condition is closed before __syncthreads is called.

edited Sep 14 '16 at 07:36

answered Sep 13 '16 at 06:47

Angew is no longer proud of SO

167,307
17
350
455

What about this example? `if ((idx < ds) && (idy < ds)){ ... __syncthreads();}` from https://github.com/olcf/cuda-training-series/blob/9cd77d539d85ee3457e09a083ba299b787b7a16e/exercises/hw2/matrix_mul_shared_solution.cu#L35 – Eduardo Reis May 16 '22 at 17:36
@EduardoReis 8192 (= `ds`) is divisible by 32 (= `blockDim`), so that conditional will never fail. I don't know if the code would still work if that wasn't the case. – Angew is no longer proud of SO May 17 '22 at 06:59
I know that in my kernel code I have something similar `if (idx>=ds) return` whereas the cases the thread does not return, it executes a `__synchthreads();`; and `ds` is not a multiple of `32`. Apparently it works and I don't have deadlocks. – Eduardo Reis May 17 '22 at 16:52
Just noticed that this matter is further discussed [here](https://forums.developer.nvidia.com/t/a-stupid-question-on-syncthread-function/64454/2). – Eduardo Reis May 17 '22 at 17:42
@EduardoReis To summarise the external discussion: If `__syncthreads();` is skipped by some threads, it *may* hang. The code is incorrect, but not guaranteed to fail. – Angew is no longer proud of SO May 18 '22 at 06:54
Does that make the code provided in all tiled-matrix multiplication examples out there, such as the [one provided by NVIDIA folks](https://github.com/olcf/cuda-training-series/blob/9cd77d539d85ee3457e09a083ba299b787b7a16e/exercises/hw2/matrix_mul_shared_solution.cu#L35), incorrect? Asserting that feels wrong. – Eduardo Reis May 19 '22 at 03:49
For matrices with size that is not multiple of `32`, some threads within the same warp would skip `__synchthreads()`, according the my understanding of the theory presented here. – Eduardo Reis May 19 '22 at 03:52
@EduardoReis I've reached the limit of my understanding as it was built up from reading CUDA documentation and other materials (such as the discussion you linked to). If you need further clarification, you might post a dedicated question, or perhaps reach out to the author of the OLCF code you're linking to. – Angew is no longer proud of SO May 19 '22 at 06:47
thank you so much for your time taking a look at this and your recommendation. I will do that. As of now, I am writing my trying to avoid having threads not failing on the `__syncthreads()`, such as shown in [here](https://gist.github.com/dniku/f168582fe98ba97346e4). – Eduardo Reis May 19 '22 at 14:24

How to reduce in CUDA if __syncthreads can't be called inside conditional branches?

1 Answers1