Questions tagged [micro-architecture]
107 questions
5
votes
0 answers
Is prefetch useless if it doesn't complete before load?
Let's say we have this pseudo code, where ptr is not in any CPU cache:
prefetch_to_L1 ptr
/* 20 cycles */
load ptr
Since ptr is in main memory, the latency of the prefetch operation (from prefetch instruction decoding to ptr being available in L1…

Elliot Gorokhovsky
- 3,610
- 2
- 31
- 56
5
votes
1 answer
Way prediction in modern cache
We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped…

jhagk
- 111
- 1
- 9
5
votes
0 answers
Why is this code not hitting the micro-op cache on Haswell when changing a single instruction?
I'm trying to understand the behavior of the uop-cache (DSB in intel docs) on my Haswell chip. I'm basing myself on the Intel optimization manual and the Agner pdfs.
I've found a set of cases where the frontend reliably falls back to the MITE…

carnaval
- 51
- 4
5
votes
2 answers
Are load ops deallocated from the RS when they dispatch, complete or some other time?
On modern Intel1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch2, or when they complete3, or somewhere in-between4?
1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the…

BeeOnRope
- 60,350
- 16
- 207
- 386
5
votes
1 answer
How many ways-superscalar are modern Intel processors?
I just learned about superscalar processors (https://en.wikipedia.org/wiki/Superscalar_processor).
I also learned that as the superscalar processor increase in width / number of ways, things get more complicated and complexity increases so fast that…

Cedar
- 748
- 6
- 21
5
votes
1 answer
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
I've been reading about how the x86 memory model works and the significance of the barrier instructions on x86 and comparing to other architectures such as ARMv8. In both the x86 and ARMv8 architecture, it appears(no pun intended) that the memory…

Raghu
- 479
- 3
- 13
5
votes
2 answers
Why jnz requires 2 cycles to complete in an inner loop
I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop.
The following simple program has an inner loop with fixed size 16:
global _start
_start:
mov rcx, 100000000
.loop_outer:
mov rax, …

user10865622
- 455
- 3
- 11
5
votes
0 answers
Why is an (NVIDIA) GPU L1 cache line longer than an L2 cache line?
In NVIDIA Fermi and Kepler GPUs (probably Maxwell too), an L1 cache line is 128-bytes long, while an L2 cache line is 32-byte long. Shouldn't that be the other way around? I mean, L1 is much smaller, shouldn't it try to cache shorter segments of…

einpoklum
- 118,144
- 57
- 340
- 684
4
votes
1 answer
Temporality of ST64B and MOVDIR64B
x86_64 has an instruction movdir64b, which to my understanding is a non-temporal copy (well, at least the store is) of 64 bytes (a cache line). AArch64 seems to have a similar instruction st64b, which does an atomic store of the same size. …

Mona the Monad
- 2,265
- 3
- 19
- 30
4
votes
2 answers
Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?
All benchmarks are run on either
Icelake
or Whiskey Lake (In Skylake Family).
Summary
I am seeing a strange phenomina where it appears that when a loop
transitions from running out of the Uop Cache to running out of
the LSD (Loop Stream Detector)…

Noah
- 1,647
- 1
- 9
- 18
4
votes
1 answer
Intel JCC Erratum - should JCC really be treated separately?
Intel pushed microcode update to fix error called "Jump Conditional Code (JCC) Erratum". The update microcode caused some operation to be inefficient due to disabling putting code to ICache under certain conditions.
Published document, titled…

Alex Guteniev
- 12,039
- 2
- 34
- 79
4
votes
1 answer
How much is known publicly about the details of how Apple processors work internally?
Edit: in an attempt to avoid this question being closed as a reference request (though I still would appreciate references!), I will give a few general, non-link-only questions for concreteness. I would accept an answer for any of these, but the…

Brennan Vincent
- 10,736
- 9
- 32
- 54
4
votes
2 answers
About the RIDL vulnerabilities and the "replaying" of loads
I'm trying to understand the RIDL class of vulnerability.
This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers.
Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and…

Margaret Bloom
- 41,768
- 5
- 78
- 124
4
votes
3 answers
Conditional jump instructions in MSROM procedures?
This relates to this question
Thinking about it though, on a modern intel CPU the SEC phase is implemented in microcode meaning there would be a check whereby a burned in key is used to verify the signature on the PEI ACM. If it doesn't match then…

Lewis Kelsey
- 4,129
- 1
- 32
- 42
4
votes
1 answer
In x86 Intel VT-X non-root mode, can an interrupt be delivered at every instruction boundary?
Other than certain normal specified conditions where interrupts are not delivered to the virtual processor (cli, if=0, etc), are all instructions in the guest actually interruptible?
That is to say, when an incoming hardware interrupt is given to…

Gbps
- 857
- 2
- 14
- 29