What is the context switching mechanism in GPU?

Question

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds? Thanks

score 33 · Accepted Answer · answered Jul 07 '11 at 05:04

33

First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.

So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.

The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.

In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.

answered Jul 07 '11 at 05:04

harrism

26,505
2
57
88

Thanks for the answer. I agree with most of it. But few things need to discuss. First, I meant switching here is not the typical switching with register saving. What i meant is, after a warp requested something from memory that is not in the cache, so it is stalled, or switched - whatever you call it (but still active in the SM), and another warp comes in and occupies the SM. So what I want to know if if there is another situation, where a warp occupies the SM for more than one cycle (for example, perform a add instruction until it finished). – Zk1001 Jul 07 '11 at 05:51
If that situation doesn't happen, then I deduce something: assuming at one cycle, a SM can service 10 warps (10 warps are active), and the kernel also has 10 warps on each SM, then the scheduler switches 10 times to perform one instruction. So if there are 5 instructions for a kernel, then the hardware will switch 50 times? – Zk1001 Jul 07 '11 at 05:56
4

Don't think about "switching. Think about *issuing*. The SM has a pool of resident warps from which it can issue instructions. Which warp(s) it issues from at any given cycle is irrelevant, as long as it is always issuing instructions. – harrism Jul 08 '11 at 00:18
2

Don't think about it as switching, it is *issuing*. The SM has a pool of resident warps from which it can issue instructions. It is not important what warp it issues from at any given cycle, what matters is that it always has instructions that can be issued. Does whether or not an SM might issue two instructions in a row from the same warp affect how you program CUDA? Nope. – harrism Jul 08 '11 at 00:20
Thank you harrism. Your explanation is exact, I think. There should really be more details in the CUDA manuals. It would be better to optimize the program. – Zk1001 Jul 08 '11 at 07:19
Your answer is the opposite of what I understand from the nVidia technical papers, which claim that each shader unit has a number of warps that run 32 threads each, and keeps 1024 threads "in flight" _which are switched in and out_ by the hardware whenever a thread is stalled on memory latency to keep the ALU as busy as possible. – Damon Jul 09 '11 at 09:29
1

Damon, let me explain what I understand. I think basically those ideas are not _opposed_ at all. Switching a warp out simply means the SM just doesn't issue the next instruction of that warp, and that warp is stalled until all the threads belonging to it finish the current instruction. At the time a warp being switched out, I think some of its threads can still occupy some of the execution units due to instruction pipelining. What they really mean by switching in and out here is just whether an instruction of a warp has started to be executed in lockstep or not. – Zk1001 Jul 10 '11 at 07:11
4

"in flight", means resident on the SM. As long as they are resident, they are not "switched" -- their register set, shared memory, program counter, etc. are all maintained. The "switching" is just choosing instructions to issue from the set of resident warps. I'm trying to prevent confusion with traditional CPU thread context "switching", where to switch among executing threads requires saving and restoring allocated register values, program counter, etc. to off-chip memory (or cache) and is therefore a much more heavyweight operation. – harrism Jul 11 '11 at 01:41
Is there still no inter-warp context switching in Ampare and Turing microarchitecture? Does a block still not exit from a SM until it finishes computation in Ampare and Turing microarchitecture? – Virux Sep 13 '22 at 02:42
see the question next comment. – Walid Hanafy Dec 02 '22 at 03:50

What is the context switching mechanism in GPU?

1 Answers1

Linked