dolan, in his comment above, is raising the issue that the scheme proposed by William Pursell is going to deadlock, according to Can I use __syncthreads() after having dropped threads?. Concerning this issue, I would say that, according to conditional syncthreads & deadlock (or not), the code will not deadlock on most GPUs, since they support early exit because in those GPUs the hardware maintains an active thread count for each block: this count is then used for barrier synchronization rather than the initial thread count for the block.
I have considered the reduce4
CUDA SDK example and modified it according to the OP's question. Namely, I'm comparing the two __global__
functions:
ORIGINAL
template <class T>
__global__ void reduce4(T *g_idata, T *g_odata, unsigned int N)
{
extern __shared__ T sdata[];
unsigned int tid = threadIdx.x; // Local thread index
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; // Global thread index - Fictitiously double the block dimension
// --- Performs the first level of reduction in registers when reading from global memory.
T mySum = (i < N) ? g_idata[i] : 0;
if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
sdata[tid] = mySum;
// --- Before going further, we have to make sure that all the shared memory loads have been completed
__syncthreads();
// --- Reduction in shared memory. Only half of the threads contribute to reduction.
for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
// --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
__syncthreads();
}
// --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
if (tid < 32) {
sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 8]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 4]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 2]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 1]; __syncthreads();
}
// --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
// individual blocks
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
MODIFIED
template <class T>
__global__ void reduce4_deadlock_test(T *g_idata, T *g_odata, unsigned int N)
{
extern __shared__ T sdata[];
unsigned int tid = threadIdx.x; // Local thread index
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; // Global thread index - Fictitiously double the block dimension
// --- Performs the first level of reduction in registers when reading from global memory.
T mySum = (i < N) ? g_idata[i] : 0;
if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
sdata[tid] = mySum;
// --- Before going further, we have to make sure that all the shared memory loads have been completed
__syncthreads();
// --- Reduction in shared memory. Only half of the threads contribute to reduction.
for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid >= s) return;
sdata[tid] = mySum = mySum + sdata[tid + s];
// --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
__syncthreads();
}
// --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
if (tid < 32) {
sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 8]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 4]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 2]; __syncthreads();
sdata[tid] = mySum = mySum + sdata[tid + 1]; __syncthreads();
}
// --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
// individual blocks
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I have checked that the modified code does not deadlock on GT210, GT540M and Kepler K20c. However, on the Kepler card, the speedup of the modified version is not that relevant (times in ms
):
N Original Modified
131072 0.021 0.019
262144 0.030 0.032
524288 0.052 0.052
1048576 0.091 0.080
2097152 0.165 0.146
4194304 0.323 0.286
8388608 0.637 0.555
16777216 1.264 1.122
33554432 2.514 2.189
I haven't checked the timings for other architectures, but probably the risk to fall stuck in a deadlock for some GPUs is not worth the reachable speedup (provided that the reachable speedup remains of the same order of magnitude).