CUDA kernel with single branch runs 1.5x faster than kernel without branch

Question

I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.

Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.

Idea of the algorithm:

Run a loop for all possible interaction types (which is five)
At every cycle a warp thread votes for its interaction type
After loop completion every warp thread knows about another threads with the same interaction type
Threads elect they leader (per interaction type)
Leader updates interactions offsets table using atomicAdd
Each thread writes its data to corresponding offset

I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/

My first kernel contains a branch inside loop and runs for ~5ms:

int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
  if (i == decision)
  {
    active = __ballot(1);
    leader = __ffs(active) - 1;
    warp_progress = __popc(active);
  }
}

My second kernel use lookup table of two elements, use no branching and runs for ~8ms:

int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
  const int masks[2] = { 0, ~0 };
  int mask = masks[i == decision];
  active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);

Common part:

int warp_offset;
if (lane_id() == leader)
  warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...

How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?

UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src

In both versions you have a `for` loop which runs `hit_interaction_count` times. But in your "branch version" you really just execute these three lines a single time while in your second kernel you execute all three lines in all runs. Why should the second version be faster? Of course a simple `if` check is faster than the three lines in your second version? And additionally I hope that your `for`-`if` construct gets optimized away to a simple `if` block with the check if `decision < hit_interaction_count` holds or not. — BlameTheBits, Sep 20 '17 at 22:59
This is a warp of threads. So it does not matter execute other threads code inside `if` or not, they be stalled anyway. `if` block can't be optimized because every thread in a warp has its own `decision` value. — Stepan Tezyunichev, Sep 20 '17 at 23:26
I believe a question like this ought to be accompanied by a [mcve] for both cases. As it stannds, its not obvious what your actual comparison kernels are (since the "common part" is unclear and doesn't show some code) and nobody could run it to see the difference or even compile the two cases to compare machine code. — Robert Crovella, Sep 21 '17 at 01:22

score 2 · Answer 1 · answered Sep 22 '17 at 12:34

2

I think this is CPU programmer brain deformation. On CPU I expect performance boost because of eliminated branch and branch misprediction penalty.

But there is no branch prediction on GPU and no penalty, so only instructions count matters.

First I need to rewrite code to the simple one.

With branch:

int active;
for (int i = 0; i != hit_interaction_count; ++i)
    if (i == decision)
        active = __ballot(1);

Without branch:

int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
  int mask = 0 - (i == decision);
  active |= (mask & __ballot(mask));
}

In first version there are ~3 operations: compare, if and __ballot(). In second version there are ~5 operations: compare, make mask, __ballot(), & and |=. And there are ~15 ops in common code.

Both loops runs for 5 cycles. It total 35 ops in first, and 45 ops in second. This calculation can explain performance degradation.

answered Sep 22 '17 at 12:34

Stepan Tezyunichev

21
2

I still don't understand why you don't reduce your "branch version" to a simple `if` statement like `if (decision < hit_interaction_count) active = ...`. Maybe add another condition to check if `decision` is `>= 0`. Or just initialize the variable directly with `__ballot(1)` as the lack of default initialization of the `active` variable seem to indicate that it will always be initialized later in that `for` loop. – BlameTheBits Sep 22 '17 at 22:08
Every ray (thread in a warp) has its own random decision: absorb, reflect, diffuse or transmit. And thread does not know about another thread decision. I'm using `__ballot()` to calculate count of decisions of each type and perform `atomicAdd()` once per warp. For single thread with particular decision only one `__ballot()` matters. But there are another threads with (may be) different decisions. That is why I need to `__ballot()` several times. – Stepan Tezyunichev Sep 24 '17 at 07:17

CUDA kernel with single branch runs 1.5x faster than kernel without branch

1 Answers1