In Section 9.5.3 of Intel® 64 and IA-32 Architectures Optimization Reference Manual, the effects of hardware prefetching are described as follows:
The effective latency reduction for several microarchitecture implementations is shown in Figure 9-2. For a constant-stride access pattern, the benefit of the automatic hardware prefetcher begins at half the trigger threshold distance and reaches maximum benefit when the cache-miss stride is 64 bytes.
- Family 6 model 13 and 14 are Pentium M (Dothan and Yonah respectively), from 2004 and 2006. (https://en.wikichip.org/wiki/intel/cpuid)
- Family 15 is Netburst (pentium 4), model 0,1,2 early generations, Williamette and Northwood.
- Fam 16 model 3,4 is Prescott, and model 6 is a successor to that.
Pentium 4 used 128-byte lines in its L2 cache (or pairs of 64-byte lines kept together), vs. 64-byte lines in its L1d cache.
Pentium M used 64-byte cache lines at all levels, up from 32-byte in Pentium III.
I have two questions:
- How to explain the "Effective Latency Reduction" in the figure? Literally, it should be (Latency without prefetch - Latency with prefetch) / Latency without prefetch. However, it seems that the lower the indicator is, the better, contrary to the above understanding.
- How long is the trigger threshold distance? As defined in Section 9.5.2, "It will attempt to prefetch two cache lines ahead of the prefetch stream", which is 64B*2 = 128B for LLC. However, the significant inflection point in the figure occurs around 132B. If it is "half the trigger threshold distance", the latter should be 132B*2 = 264B.
I read the context, but did not get the explanation of these two terms.