When to use volatile with shared CUDA Memory

Question

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array:

__shared__ float products[THREADS_PER_ACTION];

// some computation
products[threadIdx.x] = localSum;

// wait for everyone to finish their computation
__syncthreads();

// then a (basic, ugly) reduction:
if (threadIdx.x == 0) {
    float globalSum = 0.0f;
    for (i = 0; i < THREADS_PER_ACTION; i++)
        globalSum += products[i];
}

Do I need products to be volatile in this case? Each array entry is only accessed by a single thread, except at the end, where everything is read by thread 0. Is it possible that the compiler could cache the entire array, and so I need it to be volatile, or will it only cache elements?

Thanks!

Robert Crovella · Accepted Answer · 2021-01-12T17:42:28.907

If you don't declare a shared array as volatile, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile. However, this sort of communication pattern often also requires execution barriers to enforce ordering of reads/writes, so continue reading about barriers below.

Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.

In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block() might be useful, but in your case, __syncthreads() already has memory-fencing functionality built in. So your __syncthreads() call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.

By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.

morgwai · Answer 2 · 2022-01-18T06:44:47.803

To put simply for others who will come here:

calling __syncthreads() is stronger then declaring shared memory as volatile. __syncthreads() causes all threads from a given work-group to stop together at 1 common point and synchronize memory.

volatile OTOH keeps a given memory buffer consistent between threads by preventing compiler from doing any caching optimizations (so it may come with a cost), but each thread is free to go with its own pace, which enables compiler/hardware to perform all sorts of scheduling optimizations.
(note however, that volatile does not guarantee data integrity if a write consists from more than 1 processor instruction)

Summarizing, when all you need is a memory consistency between threads, but not stopping all together at 1 point, then volatile usually provides better performance than __syncthreads(). Your millage may vary though, depending on specific algorithm or even input data, so test both approaches if you need to squeeze every last bit of performance.

Furthermore, if a number of active threads in a work-group is smaller than the SIMD width (warp size), then volatile may be used instead of __synchthreads() as all threads in the same warp perform instructions synchronously. See for example last wrap unrolling optimization to parallel reduction algorithm (slides 21-23), which uses __synchthreads() at first and later rely only on volatile when number of active threads gets smaller than the warp size.

That is not true anymore with Dynamic Parallelism. You would have to call `__syncwarp()` instead of relying on implicit warp synchronous execution. — Sebastian, Jan 18 '22 at 07:01
The question is also, whether without `volatile` the compiler caches the uninitialized value of `products` in thread 0 and never reads it from anyway. — Sebastian, Jan 18 '22 at 07:03
@Sebastian which part is not true exactly? Could you please provide some links? This would be very helpful :) Thanks! — morgwai, Jan 19 '22 at 08:06
"as all threads in the same warp perform instructions synchronously" - Sorry, I meant Independent Thread Scheduling, see for example here https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#sm-independent-thread-scheduling and here https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ (chapter warp synchronization) — Sebastian, Jan 19 '22 at 08:21

When to use volatile with shared CUDA Memory

2 Answers2

Linked