I'm profiling my program in Linux using perf tool, when checking the report I found a place really confuse me. I attach few lines of the report below:
0.94 : 451ab5: mov (%r15),%r8
0.44 : 451ab8: mov 0x40(%rsp),%r15
0.45 : 451abd: mov (%rsi),%rsi
0.14 : 451ac0: mov (%r8,%rdi,4),%edi
5.41 : 451ac4: prefetcht0 (%rsi)
0.11 : 451ac7: lea (%r15,%rdi,4),%rdi
0.34 : 451acb: mov (%rdi),%r8d
5.62 : 451ace: add %r8d,%eax
0.18 : 451ad1: prefetchnta (%rbx,%r8,4)
24.46 : 451ad6: mov %r8,%r11
0.11 : 451ad9: mov %eax,(%rdi)
0.05 : 451adb: mov 0x4(%rdx),%eax
0.02 : 451ade: lea 0x0(,%rax,4),%edi
My confuse is why this line(mov %r8,%11
) cost so much time, from my understanding this instruction only mov data in register %r8
into %r11
. The data in %r8
is loaded in position 451acb.
My guessing is this instruction (mov (%rdi),%r8d
) is only trigger a read action but didn't actually "block", when instruction need use the content of register r8
, it "block" until the content be loaded into CPU cache.
My question if my guessing correctly?
CPU : Intel E5-2660 v4