According to Volkov's talk Better Performance at Lower Occupancy, ILP is an important method to hide latency, which increases the throughput. However, also from his talk, each SM only has two warp schedulers, which means (if I don't misunderstand) SM can issue two independent instructions in one thread. Then why with ILP > 2, the throughput can also increase (according to the experiment in Volkov's talk p15-p20)?
-
1The question seems overly broad. Yes, ILP helps somewhat, and there are more opportunities to exploit ILP in newer GPU architectures (roughly Pascal and later architectures). – njuffa Jul 27 '21 at 07:11
-
ILP is also used by pipelining during execution of the SM. ILP code could also be better reordered during compile time to reduce instruction dependencies. I do not think that the schedulers are blocked, when instructions wait for completion. They can issue independent instructions of the same warp in the following cycle. – Sebastian Jul 27 '21 at 07:53
-
So with k warp schedulers, can we reach more than k times throughput just using ILP (with more than k independent instructions consecutively)? – FlyingPig Jul 27 '21 at 08:24
-
1some warp schedulers are dual issue, some are not. a precise answer to your question requires specifying which. For k single-issue warp schedulers, it's not possible to issue more than k instructions in a particular clock cycle. For k dual-issue warp schedulers (e.g. kepler, for example) it is possible to issue more than k instructions in a particular clock cycle. Note that warp schedulers have assigned warps. This means that two separate warp schedulers cannot issue instructions pertaining to the same warp, in a particular clock cycle (or ever, for recent GPUs). – Robert Crovella Jul 27 '21 at 15:23
-
1The property of *independence* amongst instructions (which is related to ILP) in a particular instruction stream is a generally useful property. As mentioned already, the compiler may identify independence and use that to [reorder instructions](https://stackoverflow.com/questions/43832429/is-starting-1-thread-per-element-always-optimal-for-data-independent-problems-on/43833050#43833050), which can lead to performance improvement. This doesn't necessarily have anything to do with issue characteristics. – Robert Crovella Jul 27 '21 at 15:27
-
We have to distinguish between issuing instructions per cycle and more than two instructions still being executed over several cycles (e.g. special math functions or memory accesses). The first is limited by the dual issue schedulers, the second by the available pipelines. – Sebastian Jul 28 '21 at 03:49
1 Answers
Is ILP (instruction level parallelism ) helpful for GPU program optimization?
Yes. There may be several reasons. One reason is that instructions in an instruction stream that are adjacent to each other and have the characteristic of ILP possibility also have independence, meaning they do not depend on each other for execution order for correctness. This is a valuable feature in an instruction stream because it allows the instruction stream to continue to have issue-able instructions before a stall is encountered. A common reason for instruction stall is a dependency not satisfied. Independence between instructions in a stream means that this cannot be a stall reason for those instructions. The compiler is aware of this, of course, and will attempt to create groups of instructions that are independent, perhaps by loop unrolling. Another potential reason is in the case of running on a GPU architecture (e.g. Kepler) which has dual-issue capable warp schedulers. In such cases, to achieve maximum issue rate in any given clock cycle, it's necessary to have instructions adjacent to each other in the instruction stream that are independently issue-able.
each SM only has two warp schedulers
This isn't true for all GPU architectures; it varies by architecture. The issue characteristics of a warp scheduler also vary by architecture; some are single-issue capable, some are dual-issue capable
So with k warp schedulers, can we reach more than k times throughput just using ILP (with more than k independent instructions consecutively)?
Yes, if comparing to a code that does not have independent instructions consecutively, for the reason already stated. Independence allows for maximum issue rate without stalls due to dependence among instructions. This is a valuable characteristic for throughput. In recent GPU architectures, warps are "statically" assigned to warp schedulers when the threadblock is deposited on the SM by the CUDA Work Distributor. Having k warp schedulers does not mean that I need k independent instructions in sequence in order to have all warp schedulers be issue-able in a particular clock cycle. If we imagine the case where all warp schedulers happen to be working in lock step (no reason to assume this in general) then we would have the case where in a give clock cycle, each warp scheduler wants to issue the same instruction, albeit for different warp. Without additional information (such as availability of execution resources for that instruction type), we would assume that instruction may be issue-able across multiple warp schedulers. The point is that multiple warp schedulers can be engaged and issuing instructions even when there is no ILP in the instruction stream. It is not necessary to have an ILP of 4 in the instruction stream, to have the possibility that all 4 warp schedulers could issue in the same cycle. But we do need ILP when a warp scheduler wants to dual-issue (requires minimum ILP of 2 at some point in the instruction stream) and even apart from all this attention on the warp schedulers, ILP is valuable because it implies the possibility to have longer sequences of instructions that are issue-able without stalls. And that is valuable for any GPU, independent of its particular SM configuration.

- 143,785
- 11
- 213
- 257