How is a warp formed and handled by the hardware warp scheduler?

Question

My questions are about warps and scheduling. I'm using NVIDIA Fermi terminology here. My observations are below, are they correct?

A. Threads in the same warp execute the same instruction. Each warp includes 32 threads.

According to the Fermi Whitepaper: "Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. "

From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together. Each scheduler issues half of a warp to 16 cores in a cycle, and in all, two schedulers issue two warp-halves into two 16-core scheduling groups in a cycle. In another words, one warp needs to be scheduled twice, half by half, in this Fermi architecture. If a warp contains only SFU operations, then this warp needs to be issued 8 times(32/4), since there's only 4 SFPUs in an SM.

B. When a large amount of threads (say 1-D array, 320 threads) is launched, consecutive threads will be grouped into 10 warps automatically, each has 32 threads. Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.

Questions: Q1. Which part handles the threads grouping (into warps)? software or hardware? if hardware, is it the warp scheduler? and how the hardware warp scheduler is implemented and work?

Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?

Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?(this is a hardware question) Since in this case(Fermi), a warp will be scheduled twice (in two cycles) anyway. If a warp is 16 wide, simply two warps will be scheduled (also in two cycles), which seems the same with the previous case.I wonder whether this organization is due to performance concern.

What I can imagine now is: threads in the same warp can be guaranteed synchronized which can be useful sometimes, or other resources such as registers and memory are organized in the warp size basis. I'm not sure whether this is correct.

A number of your statements (e.g. 1. above) don't look correct to me. SIMD width is not really a CUDA concept, AFAIK. You might be interested in the [relevant section](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation) of the programming guide. — Robert Crovella, Feb 03 '14 at 19:29
SP = streaming processor = 16 cores, and an SM(streaming microprocessor) contains two SPs, i.e., two groups of 16 cores = 32 cores. Thanks for your reminder, I've reorganized the whole thing into Fermi terminology. — Blair, Feb 03 '14 at 21:36

Robert Crovella · Accepted Answer · 2014-02-03T23:06:37.503

Correcting some misconceptions:

A. ...From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together.

When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice.

B. ...Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.

It's true that all threads in a block (and therefore all warps) are executing from the same instruction stream, but they are not necessarily executing the same instruction. Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time. The diagram on page 10 of the Fermi whitepaper makes this evident.

Q1: Which part handles the threads grouping (into warps)? software or hardware?

It is done by hardware, as explained in the hardware implementation section of the programming guide: "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. "

and how the hardware warp scheduler is implemented and work?

I don't believe this is formally documented anywhere. Greg Smith has provided various explanations about it, and you may wish to seach on "user:124092 scheduler" or a similar search, to read some of his comments.

Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?

This question is predicated on misconceptions outlined earlier. The grouping of threads into a warp is not dynamic; it is fixed at threadblock launch time, and it follows the methodology described above in the answer to Q1. Furthermore, threads 0-15 will never be scheduled with any threads other than 16-31, as 0-31 comprise a warp, which is indivisible for scheduling purposes, on Fermi.

Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?

Again, I believe this question is predicated on previous misconceptions. The hardware units used to provide resources for a warp may exist in 16 units (or some other number) at some functional level, but from an operational level, the warp is scheduled as 32 threads, and each instruction is scheduled for the entire warp, and executed together, within some number of Fermi hotclocks.

Thanks very much for your reply. Now I realize the two 16-core groups are double clocked, and then everything makes sense. It seems "SP" is not a widely accepted terminology. I got to use it from Figure1(a) in this MICRO 2013 paper, Warped gates: gating aware scheduling and power gating for GPGPUs. http://dl.acm.org/citation.cfm?id=2540719 — Blair, Feb 03 '14 at 22:41
The gist of the double-clocking is covered in that paper in the paragraph preceding section 2.2 The use of "SP" to refer to a "shader processor" as used in that paper may be correct, but I don't think it has widespread applicability in GPU computing. I guess there are multiple definitions of "SP". — Robert Crovella, Feb 03 '14 at 23:05

score 1 · Answer 2 · edited May 23 '17 at 10:28

As far as I know:

Q1 - scheduling is done at hardware level, warps are the scheduling units and warps, their lanes constituents (a laneid is the hardware equivalent of the thread index in a warp), SMs and other components at this level are all hardware units which are abstracted and programmed via the CUDA programming model.

Q2 - It also depends on the grid: if you're launching two blocks containing a single thread each, you will end up with two warps each of which contains only one active thread. As I said all scheduling and execution is done on a warp-basis and more warps the hardware has, the more it can schedule (although they may contain dummy NOP threads) and try to hide latency/less instruction pipeline stalls.

Q3 - Once resources are allocated threads are always divided into 32-thread warps. On Fermi warp schedulers pick two warp per cycle and dispatch them to execution units. On pre-Fermi architectures SMs had fewer than 32 thread processors. Right now Fermi has 32 thread processors. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size (https://stackoverflow.com/a/14927626/1938163). Besides

The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.

you don't have a "scheduling group size" at thread-level as you wrote, but if you re-read the above statement you'll have that 16 cores (or 16 load/store units or 4 SFUs) are readied with one instruction from a 32-thread warp each. If you were asking "why 16?" well.. that's another architectural story... and I suspect it's a carefully designed tradeoff. I'm sorry but I don't know more about this.

So I guess this warp size is pretty much a fixed hardware architectural parameter, related to many other architectural parameters, such as the memory bandwidth, number of execution units and registers' width. Thanks very much for your reply! — Blair, Feb 03 '14 at 22:48

How is a warp formed and handled by the hardware warp scheduler?

2 Answers2

Linked