How to interpret the effect of stride size on Intel's hardware prefetching?

Question

In Section 9.5.3 of Intel® 64 and IA-32 Architectures Optimization Reference Manual, the effects of hardware prefetching are described as follows:

The effective latency reduction for several microarchitecture implementations is shown in Figure 9-2. For a constant-stride access pattern, the benefit of the automatic hardware prefetcher begins at half the trigger threshold distance and reaches maximum benefit when the cache-miss stride is 64 bytes.

Family 6 model 13 and 14 are Pentium M (Dothan and Yonah respectively), from 2004 and 2006. (https://en.wikichip.org/wiki/intel/cpuid)
Family 15 is Netburst (pentium 4), model 0,1,2 early generations, Williamette and Northwood.
Fam 16 model 3,4 is Prescott, and model 6 is a successor to that.

Pentium 4 used 128-byte lines in its L2 cache (or pairs of 64-byte lines kept together), vs. 64-byte lines in its L1d cache.
Pentium M used 64-byte cache lines at all levels, up from 32-byte in Pentium III.

I have two questions:

How to explain the "Effective Latency Reduction" in the figure? Literally, it should be (Latency without prefetch - Latency with prefetch) / Latency without prefetch. However, it seems that the lower the indicator is, the better, contrary to the above understanding.
How long is the trigger threshold distance? As defined in Section 9.5.2, "It will attempt to prefetch two cache lines ahead of the prefetch stream", which is 64B*2 = 128B for LLC. However, the significant inflection point in the figure occurs around 132B. If it is "half the trigger threshold distance", the latter should be 132B*2 = 264B.

I read the context, but did not get the explanation of these two terms.

Presumably the Y axis is actually "effective latency", like 100% is the full latency without prefetch. You're right that "latency *reduction*" would imply that higher is better; Intel goofed in labeling this axis the same as the title of the whole figure. — Peter Cordes, Dec 22 '22 at 20:07
Note that modern Intel CPUs have smarter prefetchers than Pentium M or especially P4. And Pentium 4 had some shenanigans with 128-byte lines, pairs of lines kept as a group, in its L2 cache. Somewhat surprising this figure is still in current versions of Intel's optimization guide. — Peter Cordes, Dec 22 '22 at 21:36
@PeterCordes Thank you for your answer. This helped me a lot in understanding the meaning of the Y axis. But I still have question about "trigger threshold distance". Take Family 6 (Pentium M) as an example. The inflection point in the figure occurs at 128 bytes, which means that its "trigger threshold distance" should be 256 bytes. However, if trigger threshold distance refers to the prefetch distance, it should be 128 bytes. — Frontier_Setter, Jan 01 '23 at 07:55
Yeah, I didn't understand that part either. :/ I'm not confident understanding this has any relevance to modern CPUs; like I said prefetchers have evolved significantly. So I didn't take the time to figure out what Intel was talking about. Maybe if I'm in the mood for some cpu architecture archaeology I'll have another look. — Peter Cordes, Jan 01 '23 at 07:59
@PeterCordes Thank you so much. I quite agree with you. Maybe this part of the optimization manual is out of date. — Frontier_Setter, Jan 01 '23 at 08:02
Yeah, that's unfortunately not rare for Intel's optimization manual. Stuff gets added when it's relevant for a current microarchitecture, but often not removed when microarchitectures change. e.g. for a long time after Pentium 4 was gone, it still recommended against the `inc` instruction when their own compiler used it. (Although to be fair, Silvermont family was another minor reason to avoid it, and it is slightly more expensive than `add reg,1` on E cores of Alder Lake; see [this answer](https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter).) — Peter Cordes, Jan 01 '23 at 08:17

How to interpret the effect of stride size on Intel's hardware prefetching?

0 Answers0