I'm reading the book "Programming Massively Parallel Processor" (3rd edition) that presents an implementation of the Kogge-Stone parallel scan algorithm. This algorithm is meant to be run by a single block (this is just a preliminary simplification) and what follows is the implementation.
// X is the input array, Y is the output array, InputSize is the size of the input array
__global__ void Kogge_Stone_scan_kernel(float* X, float* Y, int InputSize) {
__shared__ float XY[SECTION_SIZE]; // SECTION_SIZE is basically blockDim.x
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < InputSize)
XY[threadIdx.x] = X[i];
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
__syncthreads();
if (threadIdx.x >= stride)
XY[threadIdx.x] += XY[threadIdx.x - stride]; // Race condition here?
}
Y[i] = XY[threadIdx.x];
}
Regardless of the way the algorithm works, I'm a bit puzzled by the line
XY[threadIdx.x] += XY[threadIdx.x - stride]
. Say stride = 1
, then the thread with threadIdx.x = 6
will perform the operation XY[6] += XY[5]
. However, at the same time the thread with threadIdx.x = 5
will be performing XY[5] += XY[4]
. The question is: is there any guarantee that the thread 6
will read the original value of XY[5]
instead of XY[5] + XY[4]
?. Note that this is not limited to a single warp in which lockstep execution may prevent the race condition.
Thanks