Why does each thread have its own instruction address counter inside a warp?

Question

Warps in CUDA always include 32 threads, and all of these 32 threads run the same instruction when the warp is running in SM. The previous question also says each thread has its own instruction counter as quoted below.

Then why does each thread need its own instruction address counter if all the 32 threads always execute the same instruction, could the threads inside 1 warp just share an instruction address counter?

Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data

An independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the [volta whitepaper](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf): " Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp." The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing. — Robert Crovella, Sep 24 '19 at 01:31

score 9 · Accepted Answer · answered Sep 26 '19 at 18:33

9

I'm not able to respond directly to the quoted text, because I don't have the book it comes from, nor do I know the authors intent.

However, an independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the volta whitepaper:

Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp.

The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing:

Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. To maximize parallel efficiency, Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in Volta will still group together threads which are executing the same code and run them in parallel for maximum efficiency

Because of this, a Volta warp could have any number of subgroups of threads (up to the warp size, 32), which could be at different places in the instruction stream. The Volta designers decided that the best way to support this flexibility was to provide (among other things) a separate PC per thread in the warp.

answered Sep 26 '19 at 18:33

Robert Crovella

143,785
11
213
257

I'm curious about what is meant by **stack** here. – Steven Lu May 11 '20 at 16:35
to a first order approximation, it means the same thing as what is described [here](https://en.wikipedia.org/wiki/Call_stack). – Robert Crovella May 11 '20 at 16:45
Sorry. I know what a stack is, I meant, does Volta actually allocate a stack (or stack information) in the register files for each thread in a warp, whereas such info prior to Volta was global to a warp? And if so, presumably the compiler would be very conservative about using that functionality... – Steven Lu May 11 '20 at 21:14
1

I wouldn't be able to explain compiler behavior. You can inspect it yourself if you wish. Registers are allocated by the compiler. AFAIK, the stack for a given thread exists in the logical **local** space. Registers are one of the possible physical backings for that space, but not the only possible physical backing. The stack is almost certainly not a register backed entity, because based on what I see of the compiled code that I have disassembled and looked at, the stack is often accessed via register-indirect indexing, and registers cannot be accessed via indexing. Ordinary memory can. – Robert Crovella May 11 '20 at 21:53
I see, thank you. It’s very useful for support for this kind of thing to exist, as it makes porting code much easier. As always there’s more to be gained by tuning code to the architecture, or in this case the paradigm. I’d expect that there are very few reasonable situations to use true non-inlined function calls in CUDA, even though it is possible to do it correctly – Steven Lu May 12 '20 at 09:14
Why they kept the concept of warps since Volta then? – Brian Cannard Aug 11 '23 at 20:14
Because it is about efficiency and performance. Just because the GPU can execute essentially single-thread work does not mean that that is the highest performance method to get work done on the GPU. warp-wide activity is still the most efficient work path. – Robert Crovella Aug 11 '23 at 20:25

Why does each thread have its own instruction address counter inside a warp?

1 Answers1

Linked