I've got a strange performance inversion on filter kernel with and without branching. Kernel with branching runs ~1.5x faster than the kernel without branching.
Basically I need to sort a bunch of radiance rays then apply interaction kernels. Since there are a lot of accompanying data, I can't use something like thrust::sort_by_key() many times.
Idea of the algorithm:
- Run a loop for all possible interaction types (which is five)
- At every cycle a warp thread votes for its interaction type
- After loop completion every warp thread knows about another threads with the same interaction type
- Threads elect they leader (per interaction type)
- Leader updates interactions offsets table using atomicAdd
- Each thread writes its data to corresponding offset
I used techniques described in this Nvidia post https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
My first kernel contains a branch inside loop and runs for ~5ms:
int active;
int leader;
int warp_progress;
for (int i = 0; i != hit_interaction_count; ++i)
{
if (i == decision)
{
active = __ballot(1);
leader = __ffs(active) - 1;
warp_progress = __popc(active);
}
}
My second kernel use lookup table of two elements, use no branching and runs for ~8ms:
int active = 0;
for (int i = 0; i != hit_interaction_count; ++i)
{
const int masks[2] = { 0, ~0 };
int mask = masks[i == decision];
active |= (mask & __ballot(mask));
}
int leader = __ffs(active) - 1;
int warp_progress = __popc(active);
Common part:
int warp_offset;
if (lane_id() == leader)
warp_offset = atomicAdd(&interactions_offsets[decision], warp_progress);
warp_offset = warp_broadcast(warp_offset, leader);
...copy data here...
How can that be? Is there any way to implement such filter kernel so it will run faster than branching one?
UPD: Complete source code can be found in filter_kernel cuda_equation/radiance_cuda.cu at https://bitbucket.org/radiosity/engine/src