Questions tagged [micro-architecture]
107 questions
4
votes
1 answer
Is mov r64, m64 one cycle or two cycle latency?
I'm on IvyBridge, I wrote the following simple program to measure the latency of mov:
section .bss
align 64
buf: resb 64
section .text
global _start
_start:
mov rcx, 1000000000
xor rax, rax
loop:
mov rax, [buf+rax]
…

user10865622
- 455
- 3
- 11
4
votes
1 answer
When making read request to DRAM, why we need to read tag and data, not data only?
I am going through David Patterson and John Hennessy's computer architecture book. In chapter2, it mentions that we may need to make two separates request to read tag and data in two cycles if we store tags in DRAM. My question is why do we need to…

Shibo Chen
- 77
- 6
4
votes
3 answers
How modern X86 processors actually compute multiplications?
I was watching some lecture on algorithms, and the professor used multiplication as an example of how naive algorithms can be improved...
It made me realize that multiplication is not that obvious, although when I am coding I just consider it a…

speeder
- 6,197
- 5
- 34
- 51
3
votes
0 answers
intel alderlake performance degradation after spin wait
I'm tunning my program for low-latency.
I have a tight calculation function calc(); which is using SIMD floating point instructions heavily.
I had test the performance of calc(); using perf command. it shows that this calc function is using ~10k…

VariantF
- 41
- 1
- 5
3
votes
0 answers
handling x86-64 microarchitecture levels in Debian package names
I'm planning to build different versions of intense numerical program for x86-64 architectures. Conveniently, in 2020, 4 levels of x86-64 microarchitecture were defined that can be passed to the compiler via the "-march" flag.
Thus, for GCC 11 (and…

Justin JRTI
- 56
- 4
3
votes
1 answer
how do i get the cpu information for my computer i.e functional units/latency etc
i'm trying to learn assembly and in the book I'm reading I came across functional units and their latencies shown in tables in the textbook.
I was wondering what are the functional units of my CPU and what are the latencies?
integer addition,…

Megan Darcy
- 530
- 5
- 15
3
votes
1 answer
Execute operations of the same instruction separately in an OoO processor
Imagine that we have an instruction which has been divided into 3 micro-operations, and we have an out-of-order processor. My question is: these 3 uops must be executed sequentially or can the processor alternate these uops with other uops from…

isma
- 143
- 1
- 6
3
votes
0 answers
Does the store buffer hold physical or virtual addresses on modern x86?
Modern Intel and AMD chips have large store buffers to buffer stores before commit to the L1 cache. Conceptually, these entries hold the store data and store address.
For the address part, do these buffer entries hold virtual or physical addresses,…

BeeOnRope
- 60,350
- 16
- 207
- 386
3
votes
0 answers
How are micro-ops arranged in the Instruction Decode Queue (IDQ)?
Something I've been wondering for a while, but firstly, one assumption to make is that all μops produced by a macro-op could have the same rip as the macro-op (I'm pretty sure that the IQ would have a rip for each IFETCH block and the decoders could…

Lewis Kelsey
- 4,129
- 1
- 32
- 42
3
votes
1 answer
Will CPUID serialize speculative data caching?
I found the description of a speculative data caching procedure from multiple instruction entries in Intel Vol.2.
For example, the lfence:
Processors are free to fetch and cache data speculatively from regions
of system memory that use the WB,…

user10865622
- 455
- 3
- 11
3
votes
1 answer
Why dependency in a loop iteration can't be executed together with the previous one
I use this code to test the impact of dependency in a loop iteration on IvyBridge:
global _start
_start:
mov rcx, 1000000000
.for_loop:
inc rax ; uop A
inc rax ; uop B
dec rcx ; uop C
jnz .for_loop
…

user10865622
- 455
- 3
- 11
3
votes
1 answer
Architecture and microarchitecture
Can someone explain me broadly the difference between a processor’s architecture and its microarchitecture as well as the relation between them?
One should be related to its functioning parts but the other I do not see

Philippe
- 700
- 1
- 7
- 17
3
votes
0 answers
Large run-to-run variance shown by a copy-loop implemented with MOVDQU
I am seeking an explanation for results that I am seeing in a loop that moves 64bytes per-iteration, from some source memory location to some destination memory location, using the x86 movdqu instruction (movdqu instruction supports moving of 16byte…

Karthik M
- 78
- 7
2
votes
0 answers
Why does FADDP D-form have higher throughput than FADDP Q-form on the Cortex-A72
I've been operating on a rough rule of thumb that Q-form ASIMD instructions are as good or better than D-form if you've got enough data to operate on. I was therefore surprised to see when reading §3.15 of the Cortex-A72 Software Optimization Guide…

Steve Cox
- 1,947
- 13
- 13
2
votes
1 answer
How does Load Store Queue work in the presence of MSHR?
I understand the basic working of load-store queue, which is
when loads compute their address, they check the store queue for any prior stores to the same address and if there is one then they gets the data from the most recent store else from…

Nebula
- 31
- 1