Highest Voted 'micro-architecture' Questions

5

votes

0 answers

Is prefetch useless if it doesn't complete before load?

Let's say we have this pseudo code, where ptr is not in any CPU cache: prefetch_to_L1 ptr /* 20 cycles */ load ptr Since ptr is in main memory, the latency of the prefetch operation (from prefetch instruction decoding to ptr being available in L1…

asked Feb 19 '22 at 15:16

Elliot Gorokhovsky

3,610
2
31
56

5

votes

1 answer

Way prediction in modern cache

We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped…

caching cpu-architecture processor cpu-cache micro-architecture

asked Oct 03 '20 at 10:52

jhagk

111
1
9

5

votes

0 answers

Why is this code not hitting the micro-op cache on Haswell when changing a single instruction?

I'm trying to understand the behavior of the uop-cache (DSB in intel docs) on my Haswell chip. I'm basing myself on the Intel optimization manual and the Agner pdfs. I've found a set of cases where the frontend reliably falls back to the MITE…

assembly x86 cpu-architecture micro-optimization micro-architecture

asked May 04 '20 at 03:42

carnaval

51
4

5

votes

2 answers

Are load ops deallocated from the RS when they dispatch, complete or some other time?

On modern Intel1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch2, or when they complete3, or somewhere in-between4? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the…

x86 intel cpu-architecture micro-architecture

asked Jan 25 '20 at 00:46

BeeOnRope

60,350
16
207
386

5

votes

1 answer

How many ways-superscalar are modern Intel processors?

I just learned about superscalar processors (https://en.wikipedia.org/wiki/Superscalar_processor). I also learned that as the superscalar processor increase in width / number of ways, things get more complicated and complexity increases so fast that…

x86 intel cpu-architecture micro-architecture

asked Oct 16 '19 at 16:49

Cedar

748
6
21

5

votes

1 answer

How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?

I've been reading about how the x86 memory model works and the significance of the barrier instructions on x86 and comparing to other architectures such as ARMv8. In both the x86 and ARMv8 architecture, it appears(no pun intended) that the memory…

x86 x86-64 cpu-architecture memory-barriers micro-architecture

asked Sep 19 '19 at 20:23

Raghu

479
3
13

5

votes

2 answers

Why jnz requires 2 cycles to complete in an inner loop

I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop. The following simple program has an inner loop with fixed size 16: global _start _start: mov rcx, 100000000 .loop_outer: mov rax, …

x86 micro-optimization microbenchmark micro-architecture

asked Jan 12 '19 at 03:17

user10865622

455
3
11

5

votes

0 answers

Why is an (NVIDIA) GPU L1 cache line longer than an L2 cache line?

In NVIDIA Fermi and Kepler GPUs (probably Maxwell too), an L1 cache line is 128-bytes long, while an L2 cache line is 32-byte long. Shouldn't that be the other way around? I mean, L1 is much smaller, shouldn't it try to cache shorter segments of…

caching gpgpu cpu-cache micro-architecture

asked May 24 '15 at 06:29

einpoklum

118,144
57
340
684

4

votes

1 answer

Temporality of ST64B and MOVDIR64B

x86_64 has an instruction movdir64b, which to my understanding is a non-temporal copy (well, at least the store is) of 64 bytes (a cache line). AArch64 seems to have a similar instruction st64b, which does an atomic store of the same size. …

assembly x86-64 cpu-architecture arm64 micro-architecture

asked Jan 03 '22 at 03:43

Mona the Monad

2,265
3
19
30

4

votes

2 answers

Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?

All benchmarks are run on either Icelake or Whiskey Lake (In Skylake Family). Summary I am seeing a strange phenomina where it appears that when a loop transitions from running out of the Uop Cache to running out of the LSD (Loop Stream Detector)…

x86-64 cpu-architecture micro-optimization branch-prediction micro-architecture

asked Apr 14 '21 at 20:53

Noah

1,647
1
9
18

4

votes

1 answer

Intel JCC Erratum - should JCC really be treated separately?

Intel pushed microcode update to fix error called "Jump Conditional Code (JCC) Erratum". The update microcode caused some operation to be inefficient due to disabling putting code to ICache under certain conditions. Published document, titled…

assembly x86 intel cpu-architecture micro-architecture

asked Jun 10 '20 at 14:25

Alex Guteniev

12,039
2
34
79

4

votes

1 answer

How much is known publicly about the details of how Apple processors work internally?

Edit: in an attempt to avoid this question being closed as a reference request (though I still would appreciate references!), I will give a few general, non-link-only questions for concreteness. I would accept an answer for any of these, but the…

iphone cpu-architecture arm64 micro-optimization micro-architecture

asked May 31 '19 at 02:48

Brennan Vincent

10,736
9
32
54

4

votes

2 answers

About the RIDL vulnerabilities and the "replaying" of loads

I'm trying to understand the RIDL class of vulnerability. This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers. Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and…

x86 cpu cpu-architecture micro-architecture cpu-mds

asked May 17 '19 at 13:19

Margaret Bloom

41,768
5
78
124

4

votes

3 answers

Conditional jump instructions in MSROM procedures?

This relates to this question Thinking about it though, on a modern intel CPU the SEC phase is implemented in microcode meaning there would be a check whereby a burned in key is used to verify the signature on the PEI ACM. If it doesn't match then…

x86 intel cpu-architecture branch-prediction micro-architecture

asked Apr 23 '19 at 14:04

Lewis Kelsey

4,129
1
32
42

4

votes

1 answer

In x86 Intel VT-X non-root mode, can an interrupt be delivered at every instruction boundary?

Other than certain normal specified conditions where interrupts are not delivered to the virtual processor (cli, if=0, etc), are all instructions in the guest actually interruptible? That is to say, when an incoming hardware interrupt is given to…

x86 intel interrupt cpu-architecture micro-architecture

asked Feb 22 '19 at 06:46

Gbps

857
2
14
29

Questions tagged [micro-architecture]