CUDA Kepler: not enough ALUs

Question

According to the Kepler whitepage, the warp size for a Kepler based GPU is 32 and each multiprocessor contains 4 warp schedulars which select two independant instructions from a chosen warp. This means that each clock cycle, 32*4*2 = 256 calculations are to be performed, but a multiprocessor only contains 192 ALUs. How are these calculations performed then?

By definition, warps run the same instruction, so there are *4 to 6* instructions per SMX per cycle. But I don't see how this is a programming question, so I have voted to close it. — talonmies, May 28 '14 at 16:19
Okay, I misused the term "instruction", the question has been updated accordingly. — PieterV, May 28 '14 at 16:23
Also you haven't read correctly. The documentation and whitepaper says "up to two instructions per warp", not "two instructions per warp". — talonmies, May 28 '14 at 16:26
Okay, but still, what if 2 instructions are selected? I might not happen always, but it is possible. — PieterV, May 28 '14 at 16:27

score 2 · Accepted Answer · answered May 28 '14 at 16:27

The actual whitepaper wording is as follows:

The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently. Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle.

The interpretation is that in any given cycle, at most 4 warps can be scheduled. For each of those 4 warps, (up to) 2 independent instructions per warp can be dispatched. "can be dispatched" is not the same as "will be dispatched".

The 192 ALUs you are referring to are related to single precision floating point arithmetic operations (SP units for the purpose of this discussion). However there are other functional units in the SM(X) such as double precision floating point arithmetic units (DP units), load/store units (LD/ST units), and other units. Refer to the diagram on page 8 of the whitepaper linked above. If a given set of instructions were all using the SP units, then 8 instructions could not be scheduled, at most 6 (32x6=192) could be scheduled. However, if the instruction mix contains independent instructions of different types (e.g. loads, stores, SP ops, etc.) then the limitation of 192 SP units will not necessarily be the determining factor in how many instructions actually get scheduled in any given cycle.

The bottom line is that 8 instructions (2 inst/scheduler x 4 schedulers) per cycle is the maximum possible instruction issue rate per SM(X). Real world codes do not necessarily achieve this. It's entirely possible that in a given cycle no instructions could get issued, due to stall/starvation conditions.

CUDA Kepler: not enough ALUs

1 Answers1