Why do mov reg,reg instructions reading the result of a load account for so many cycles with perf record?

Question

I'm profiling my program in Linux using perf tool, when checking the report I found a place really confuse me. I attach few lines of the report below:

  0.94 :          451ab5:       mov    (%r15),%r8
  0.44 :          451ab8:       mov    0x40(%rsp),%r15
  0.45 :          451abd:       mov    (%rsi),%rsi
  0.14 :          451ac0:       mov    (%r8,%rdi,4),%edi
  5.41 :          451ac4:       prefetcht0 (%rsi)
  0.11 :          451ac7:       lea    (%r15,%rdi,4),%rdi
  0.34 :          451acb:       mov    (%rdi),%r8d
  5.62 :          451ace:       add    %r8d,%eax
  0.18 :          451ad1:       prefetchnta (%rbx,%r8,4)
 24.46 :          451ad6:       mov    %r8,%r11
  0.11 :          451ad9:       mov    %eax,(%rdi)
  0.05 :          451adb:       mov    0x4(%rdx),%eax
  0.02 :          451ade:       lea    0x0(,%rax,4),%edi

My confuse is why this line(mov %r8,%11) cost so much time, from my understanding this instruction only mov data in register %r8 into %r11. The data in %r8 is loaded in position 451acb.

My guessing is this instruction (mov (%rdi),%r8d) is only trigger a read action but didn't actually "block", when instruction need use the content of register r8, it "block" until the content be loaded into CPU cache.

My question if my guessing correctly?

CPU : Intel E5-2660 v4

I'm assuming this is on an Intel CPU (not AMD or VIA)? I tagged intel-pmu because it's the hardware that chooses how to account cycles to instructions. (Hard problem for superscalar out-of-order execution). Anyway yes, it's normal that instructions consuming a load result get the blame if they have to wait, not the load itself. — Peter Cordes, Aug 02 '19 at 02:59

Why do mov reg,reg instructions reading the result of a load account for so many cycles with perf record?

0 Answers0