Is port blocked when data is fetching from cache or memory in CPU microarchitecture?

Question

There are two identical memory read ports (port 2 and 3) and one write port (port 4) of Intel Skylake cores. Assuming there are two load instructions issued to port 2 and port 3 parallelly:

When both data can be fetched from L1 cache ( about ~10ns), will port 2 and 3 be blocked until data is fetched and load instruction is retired?
What if data is not available in cache and must be accessed from memory? Will load ports be blocked for a long time?
Another guess, when data is fetching from cache or memory, data request will be cached in load cache in MOB and port is released for next load. It means that a port can serve multiple load simultaneously when data is on path from cache/memory to core?

It could be much better if there is some support material. I googled but found no answer.

score 1 · Accepted Answer · answered Jan 20 '23 at 05:06

The load execution units are fully pipelined, sustaining 2 loads per clock on cache hits. See https://agner.org/optimize/ and https://uops.info/, and note the experimental test results verifying sustained 2/clock execution load uops.

Try it yourself with a loop like this, in a static executable, and run it under perf stat ./a.out and note that it runs the loop at 1 cycle per iteration (2 loads).

 mov rdi, rsp
 mov edx, 1000000000
.loop
  mov eax, [rdi]
  mov ecx, [rdi+4]
  dec edx
  jnz .loop

 mov eax, 231
 syscall           ; Linux _exit(edi)

Also see Intel's optimization manual, where you can see Skylake's sustained L1d bandwidth over 80 bytes per cycle (2 loads and 1 store, of 32 byte vectors). Apparently something sometimes prevents sustaining the full 2 loads + 1 store per clock, at least with vectors that wide, but it definitely doesn't stall.

L1d cache miss doesn't stall either; load uops can keep executing until you run LFBs and stall. But even with the LFBs all waiting for incoming cache lines, loads that hit in L1d cache can still execute. Also, loads that load from the same cache line as another outstanding load can pile on to the same LFB. (Or you might also run out of load buffers, which would stop the alloc/rename stage from issuing more load uops into the back end.)

Also, L1d cache hit latency is 5 cycles on modern Intel; that's just over 1 ns, not 10! https://www.7-cpu.com/cpu/Skylake.html

See also https://www.realworldtech.com/haswell-cpu/.

Also https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: cache misses eventually stalling.

IIRC Skylake can get 96 byes of total bandwidth per cycle (64 R + 32 W) with AVX2 stores as long as you disable preferching. I don't recall the details, but I guess it would be the L1 prefetches implicated here? It would affect all access sizes, but exacerbated at the larger sizes because prefetches occur more frequently as the time to consume each cache line is shorter. — BeeOnRope, Jan 23 '23 at 02:25

Is port blocked when data is fetching from cache or memory in CPU microarchitecture?

1 Answers1