Will CPUID serialize speculative data caching?

Question

I found the description of a speculative data caching procedure from multiple instruction entries in Intel Vol.2.

For example, the lfence:

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the LFENCE instruction; data can be brought into the caches speculatively just before, during, or after the execution of an LFENCE instruction.

Also, I found from online resources that the speculative caching will move data from farther cache to closer cache as well.

I want to know whether the strongest serializing instruction CPUID will prevent speculative caching across the barrier.

I've already searched the CPUID entry in Intel Vol.2 and the "serializing instruction" section in Intel Vol.3. But it shows nothing about speculative data caching.

Peter Cordes · Accepted Answer · 2019-07-10T04:33:09.007

LFENCE is already strong enough (in practice at least) to stop the CPU from actually looking at load instructions after it, but the CPU is free to speculatively load for other reasons.

Stopping that would require some kind of lookahead past the barrier to find out what addresses to disable HW prefetch for. That's not practical at all. CPUID or other serializing instructions aren't any stronger than LFENCE for stopping load prefetches.

The CPU is always allowed to speculatively fetch from memory in WB and WT regions / pages. Intel's optimization manual documents some stuff about the hardware prefetchers in some of their CPU models, so you could in practice avoid doing things before CPUID that are likely to trigger such prefetches.

(WC is weakly-ordered uncacheable+write-combining, but speculative fetch is also allowed there on paper. In real life that probably only happens in the shadow of a branch mispredict, not HW prefetch. It's not normally cacheable like WB and WT.)

If you're microbenchmarking a real CPU, the trick to some kinds of microbenchmarks is to find an access pattern that won't trigger HW prefetching, or to disable the HW prefetchers.

Maybe in theory you could have an x86 CPU that looked ahead in the instruction stream for load/store instructions and speculatively prefetched for them, separate from actually executing them (which Intel's definition of LFENCE would block). I don't think anything would stop it from doing that across CPUID either.

Probably nobody will design such a CPU, because

It's not worth the transistors / power. Starting prefetch as soon as regular out-of-order execution can get to it is already good enough. And except for absolute / RIP-relative addresses or direct jumps, you'd need register values from the OoO core to get a useful prefetch address.
Looking past LFENCE / CPUID is perverse; they're rare enough that defeating speculative "execution" of loads past them is part of the point, in the age of Spectre.

After more thinking about it, I think the HW prefetcher is sort of an independent unit running in parallel with the instruction flow. The instruction flow gives it hint, then the prefetcher fetches data to cache independently. Am I correct? — user10865622, Jan 15 '19 at 08:48
In my mental model, `lfence` is more of an "instruction fence" than a "load fence", so I compare it with `cpuid` and find the lack of the quoted paragraph in `cpuid` curious. Maybe the intent of that paragraph is to address the "load fence" part of this instruction, because loading to cache is sort of a "load". Therefore this paragraph isn't included in `cpuid`. — user10865622, Jan 15 '19 at 08:51
@user10865622: yup, and shared caches (shared between multiple cores) can have prefetchers, too. I forget if Intel or AMD's current CPUs have L3 prefetching. Most of the PF logic is in the private per-core L2 on Intel, with some in L1d and L1i. As for why CPUID doesn't have similar language, yeah probably because the name of the instruction implies memory ordering. (But yes, it's nearly useless for memory ordering, only instruction ordering. The only memory order use-case for `lfence` is I think ordering NT loads from WC memory: as a LoadLoad and LoadStore barrier, but not StoreLoad.) — Peter Cordes, Jan 15 '19 at 08:55
Right, all of the data prefetchers only prefetch from WB locations (not even WT). See 2.4.5.4 of the optimization manual: "Load is from writeback memory type." In addition to the fact that the prefetchers don't cross 4KB page boundaries and that memory types are per page. — Hadi Brais, Jul 10 '19 at 10:03

Will CPUID serialize speculative data caching?

1 Answers1