3

The CUDA programming guide states that

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

So if I need to synchronize threads with a conditional branching across a block, some of which threads may or may not take the branch that includes the __syncthreads() call, does this mean that it won't work?

I'm imagining that there might be all sorts of cases in which you might need to do this; for example, if you have a binary mask and need to apply a certain operation on pixels conditionally. Say, if (mask(x, y) != 0) then execute the code that includes __syncthreads(), otherwise do nothing. How would that be done?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Kristian D'Amato
  • 3,996
  • 9
  • 45
  • 69

2 Answers2

8

If you need to go this route you could split the body into two phases:

if (condition)
{
    // code before sync
}
__syncthreads();
if (condition) // or remember a flag or whatever
{
    // code after sync
}

Alternatively you could use the condition to set a flag that disables certain operations, for example if you're computing a delta update you could do the following:

// *ALL* compute a delta update, those threads that would have failed the condition
// simply compute garbage.
// This can include syncthreads
if (condition)
    // apply update
Tom
  • 20,852
  • 4
  • 42
  • 54
  • This answer is a bit dated, but it shows up high on the search results. I found another answer has more relevance in today's CUDA functionality: http://stackoverflow.com/questions/15146886/conditional-syncthreads-deadlock-or-not . Maybe this answer needs an update? – Liang Apr 28 '16 at 00:00
  • @Liang: Theoretically, this answer is still correct; the CUDA model states that all threads within a block must reach the barrier. The post you referenced describes why early exit works, but it does not work on all GPUs (G80) and there's no guarantee it will always work. – Tom Apr 28 '16 at 17:46
1

From 3.0 you can use the warp vote functions to accomplish what __syncthreads can't:

Warp vote functions are only supported by devices of compute capability 1.2

int __all(int predicate); predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them.

int __any(int predicate); evaluates predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them.

unsigned int __ballot(int predicate); evaluates predicate for all threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp. This function is only supported by devices of compute capability 2.x.

Otherwise there are also the Atomic Bitwise functions

atomicAnd, atomicOr, atomicXor

See section B.11 of the cuda programming Guide

Community
  • 1
  • 1
fabrizioM
  • 46,639
  • 15
  • 102
  • 119