How to call device function in CUDA with fewer threads

Question

I want to call an exclusive scan function from inside a kernel that does a radix sort. But the exclusive scan only needs half of the threads to do its work.

The exclusive scan algorithm needs several __syncthreads() in it. If i have a statement at the start like

if(threadIdx.x > NTHREADS/2) return;

these threads will not participate in the exclusive scan syncthreads, which is not allowed. Is there some way around this problem. I do have the call to exclusive scan surrounded by __syncthread()s.

Robert Crovella · Answer 1 · 2013-03-18T23:23:52.047

Something like this should work (don't use the early return):

__syncthreads(); // at entry to exclusive scan region
// begin exclusive scan function
if (threadIdx.x < NTHREADS/2) {
  // do first phase of exclusive scan up to first syncthreads
  }
__syncthreads(); // first syncthreads in exclusive scan function
if (threadIdx.x < NTHREADS/2) {
  // do second phase of exclusive scan up to second syncthreads
  }
__syncthreads(); // second syncthreads in exclusive scan function
(... etc.)
__syncthreads(); // at exit from exclusive scan region

It's somewhat tedious but it's the only way I know of to adhere to the letter of the law on __syncthreads() usage. You can also try just leaving the code the way you indicated, with threads that are doing no work take an early return/exit. It may just work, probably will work. But there's no guarantee that it will work for a future architecture or newer toolchain.

`there's no guarantee that it will work for a future architecture` I can attest to this. A legacy library that I needed to fix was working fine for years on Kepler, Maxwell, Pascal, Turing, then suddenly had deadlock issues on Ampere. The reason was a piece of code that calls `return` in threads that do no work, but later call `__syncthreads()`. We still don't know if it's due to the architecture, or just faster GPU — Huy Le, Nov 16 '22 at 08:29

score 2 · Answer 2 · answered Mar 19 '13 at 03:17

Just to point out an alternative:
You can also use the inline assembly equivalent of __syncthreads(), which allows to use the optional argument for the number of participating threads that is available from compute capability 2.0 onwards. Something like this should work:

#define __syncthreads_active(active_threads) asm volatile("bar.sync 0, %0;" :: "r"(active_threads));

if(threadIdx.x >= NTHREADS/2) return;

int active_warps = (NTHREADS/2 + warpSize) / warpSize;
int active_threads = active_warps * warpSize; // hopefully the compiler will optimize this to a simple active_threads = (NTHREADS/2 + warpSize) & ~32

__syncthreads_active(active_threads);
// do some work...
__syncthreads_active(active_threads);
// do some more work...
__syncthreads_active(active_threads);

DISCLAIMER: Written in the browser and entirely untested!

Whether it's worth the trouble is another question though.

I learned something about inline asm so +1, but `__syncthreads` already and always increments by the number of threads in a warp -- no more, no less -- even if just one thread in a divergent branch hits the `__syncthreads`. This effectively makes it a [per-warp barrier](http://stackoverflow.com/a/30382467/2778484) instruction anyway. — chappjc, May 21 '15 at 19:55

How to call __device__ function in CUDA with fewer threads

2 Answers2

How to call device function in CUDA with fewer threads