Under what circumstances should you use the volatile
keyword with a CUDA kernel's shared memory? I understand that volatile
tells the compiler never to cache any values, but my question is about the behavior with a shared array:
__shared__ float products[THREADS_PER_ACTION];
// some computation
products[threadIdx.x] = localSum;
// wait for everyone to finish their computation
__syncthreads();
// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
float globalSum = 0.0f;
for (i = 0; i < THREADS_PER_ACTION; i++)
globalSum += products[i];
}
Do I need products
to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile
, or will it only cache elements?
Thanks!