When can threads of a warp get scheduled independently on Volta+?

Question

Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:

Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures

From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.

Now, consider listing 8 from this blog post, reproduced below:

unsigned tid = threadIdx.x;
int v = 0;

v += shmem[tid+16]; __syncwarp();  // 1
shmem[tid] = v;     __syncwarp();  // 2
v += shmem[tid+8];  __syncwarp();  // 3
shmem[tid] = v;     __syncwarp();  // 4
v += shmem[tid+4];  __syncwarp();  // 5
shmem[tid] = v;     __syncwarp();  // 6
v += shmem[tid+2];  __syncwarp();  // 7
shmem[tid] = v;     __syncwarp();  // 8
v += shmem[tid+1];  __syncwarp();  // 9
shmem[tid] = v;

Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without any of the __syncwarp() calls. This seems to contradict the statement I quote above.

I would appreciate if someone can clarify this confusion?

The second citation from the whitepaper mentioned in [this answer](https://stackoverflow.com/a/58122848/10107454) should answer your question, I think. [This one](https://stackoverflow.com/a/71156298/10107454) also goes deeper into this topic. — paleonix, Jun 21 '22 at 11:55
One simplistic way to look at it is that before Volta, the underlying hardware was something like "AVX-1024", executing instructions of the whole warp in each step with possible masking. The change in Volta allows the hardware designers to for example use higher clocked "AVX-512", where you never have the full warp executing instruction at once, but always running in half-warps. With this implementation, `__syncwarp()` forces the scheduler to alternate the half-warps, instead of for example running the first half-warp to the end before starting the second half-warp. — MadKarel, Jun 21 '22 at 13:53

Robert Crovella · Accepted Answer · 2022-06-21T14:36:38.207

From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.

If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:

Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures

This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.

For the remainder of your question, I guess instead of this:

I would appreciate if someone can clarify this confusion?

You are asking:

Why is the syncwarp needed?

Two reasons:

As stated near the top of that post:

Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp

A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.

The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.

The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.

When can threads of a warp get scheduled independently on Volta+?

1 Answers1