Questions tagged [micro-architecture]

107 questions
4
votes
1 answer

Is mov r64, m64 one cycle or two cycle latency?

I'm on IvyBridge, I wrote the following simple program to measure the latency of mov: section .bss align 64 buf: resb 64 section .text global _start _start: mov rcx, 1000000000 xor rax, rax loop: mov rax, [buf+rax] …
4
votes
1 answer

When making read request to DRAM, why we need to read tag and data, not data only?

I am going through David Patterson and John Hennessy's computer architecture book. In chapter2, it mentions that we may need to make two separates request to read tag and data in two cycles if we store tags in DRAM. My question is why do we need to…
4
votes
3 answers

How modern X86 processors actually compute multiplications?

I was watching some lecture on algorithms, and the professor used multiplication as an example of how naive algorithms can be improved... It made me realize that multiplication is not that obvious, although when I am coding I just consider it a…
speeder
  • 6,197
  • 5
  • 34
  • 51
3
votes
0 answers

intel alderlake performance degradation after spin wait

I'm tunning my program for low-latency. I have a tight calculation function calc(); which is using SIMD floating point instructions heavily. I had test the performance of calc(); using perf command. it shows that this calc function is using ~10k…
VariantF
  • 41
  • 1
  • 5
3
votes
0 answers

handling x86-64 microarchitecture levels in Debian package names

I'm planning to build different versions of intense numerical program for x86-64 architectures. Conveniently, in 2020, 4 levels of x86-64 microarchitecture were defined that can be passed to the compiler via the "-march" flag. Thus, for GCC 11 (and…
3
votes
1 answer

how do i get the cpu information for my computer i.e functional units/latency etc

i'm trying to learn assembly and in the book I'm reading I came across functional units and their latencies shown in tables in the textbook. I was wondering what are the functional units of my CPU and what are the latencies? integer addition,…
Megan Darcy
  • 530
  • 5
  • 15
3
votes
1 answer

Execute operations of the same instruction separately in an OoO processor

Imagine that we have an instruction which has been divided into 3 micro-operations, and we have an out-of-order processor. My question is: these 3 uops must be executed sequentially or can the processor alternate these uops with other uops from…
3
votes
0 answers

Does the store buffer hold physical or virtual addresses on modern x86?

Modern Intel and AMD chips have large store buffers to buffer stores before commit to the L1 cache. Conceptually, these entries hold the store data and store address. For the address part, do these buffer entries hold virtual or physical addresses,…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
3
votes
0 answers

How are micro-ops arranged in the Instruction Decode Queue (IDQ)?

Something I've been wondering for a while, but firstly, one assumption to make is that all μops produced by a macro-op could have the same rip as the macro-op (I'm pretty sure that the IQ would have a rip for each IFETCH block and the decoders could…
Lewis Kelsey
  • 4,129
  • 1
  • 32
  • 42
3
votes
1 answer

Will CPUID serialize speculative data caching?

I found the description of a speculative data caching procedure from multiple instruction entries in Intel Vol.2. For example, the lfence: Processors are free to fetch and cache data speculatively from regions of system memory that use the WB,…
3
votes
1 answer

Why dependency in a loop iteration can't be executed together with the previous one

I use this code to test the impact of dependency in a loop iteration on IvyBridge: global _start _start: mov rcx, 1000000000 .for_loop: inc rax ; uop A inc rax ; uop B dec rcx ; uop C jnz .for_loop …
3
votes
1 answer

Architecture and microarchitecture

Can someone explain me broadly the difference between a processor’s architecture and its microarchitecture as well as the relation between them? One should be related to its functioning parts but the other I do not see
Philippe
  • 700
  • 1
  • 7
  • 17
3
votes
0 answers

Large run-to-run variance shown by a copy-loop implemented with MOVDQU

I am seeking an explanation for results that I am seeing in a loop that moves 64bytes per-iteration, from some source memory location to some destination memory location, using the x86 movdqu instruction (movdqu instruction supports moving of 16byte…
2
votes
0 answers

Why does FADDP D-form have higher throughput than FADDP Q-form on the Cortex-A72

I've been operating on a rough rule of thumb that Q-form ASIMD instructions are as good or better than D-form if you've got enough data to operate on. I was therefore surprised to see when reading §3.15 of the Cortex-A72 Software Optimization Guide…
Steve Cox
  • 1,947
  • 13
  • 13
2
votes
1 answer

How does Load Store Queue work in the presence of MSHR?

I understand the basic working of load-store queue, which is when loads compute their address, they check the store queue for any prior stores to the same address and if there is one then they gets the data from the most recent store else from…