Terminating unactive threads during reduction

Question

Most reductions I've ever seen look like:

for( i = N; i > 0; i /=2 ) {
    if( tid < i )
        assign-shared;
    __syncthreads();
}
if( tid == 0 )
    copy-value-to-global;

I've just reversed that to:

for( i = N; i > 0; i /= 2 ) {
    if( tid >= i )
        return;
    assign-shared;
    __syncthreads();
}
copy-value-to-global;

and noticed a substantial performance benefit. Is there any drawback to having the threads that are no longer involved in the reduction return early?

The second code will lead to dead lock. See my question http://stackoverflow.com/questions/6666382/can-i-use-syncthreads-after-having-dropped-threads — user703016, Jul 13 '11 at 08:25

score 1 · Answer 1 · edited May 23 '17 at 11:57

dolan, in his comment above, is raising the issue that the scheme proposed by William Pursell is going to deadlock, according to Can I use __syncthreads() after having dropped threads?. Concerning this issue, I would say that, according to conditional syncthreads & deadlock (or not), the code will not deadlock on most GPUs, since they support early exit because in those GPUs the hardware maintains an active thread count for each block: this count is then used for barrier synchronization rather than the initial thread count for the block.

I have considered the reduce4 CUDA SDK example and modified it according to the OP's question. Namely, I'm comparing the two __global__ functions:

ORIGINAL

template <class T>
__global__ void reduce4(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) {
        sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  8]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  4]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  2]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  1]; __syncthreads();
    }

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

MODIFIED

template <class T>
__global__ void reduce4_deadlock_test(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid >= s) return;
        sdata[tid] = mySum = mySum + sdata[tid + s];
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) {
        sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  8]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  4]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  2]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  1]; __syncthreads();
    }

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
    }

I have checked that the modified code does not deadlock on GT210, GT540M and Kepler K20c. However, on the Kepler card, the speedup of the modified version is not that relevant (times in ms):

N          Original          Modified
131072     0.021             0.019
262144     0.030             0.032
524288     0.052             0.052
1048576    0.091             0.080
2097152    0.165             0.146
4194304    0.323             0.286
8388608    0.637             0.555
16777216   1.264             1.122
33554432   2.514             2.189

I haven't checked the timings for other architectures, but probably the risk to fall stuck in a deadlock for some GPUs is not worth the reachable speedup (provided that the reachable speedup remains of the same order of magnitude).

score 1 · Accepted Answer · answered Feb 01 '11 at 19:38

Since you're already performing an if statement with your original code, I don't see any drawback.

If the results of your if statement did not have spatial locality (generally the same result across the block), you may not see any speedup. Also, the speedup may be dependent on the capabilities of your device: earlier CUDA devices may not give you the performance enhancement.

Pavan Yalamanchili · Answer 3 · 2011-02-07T05:37:24.997

1

The second code segment provides better performance as the unused warps do not need to come back and perform a branching check.

Ideally, in the second case you would be retiring one warp per iteration reducing the load on the GPU.

edited Feb 07 '11 at 05:37

answered Feb 06 '11 at 22:34

Pavan Yalamanchili

12,021
2
35
55

Terminating unactive threads during reduction

3 Answers3