I ran the following Linux perf commands:
perf record -e mem_load_retired.l1_hit:P -c 10000 -a -- ./Program_to_Test.exe
perf report > mem_load_retired.l1_hit.txt
perf annotate > mem_load_retired.l1_hit_ann.txt
The annotate file shows that 100% of the mem_load_retired.l1_hit instances occurred at line 231, and again at lines 257-258:
mem_load_retired.l1_hit 231 100.00 vcvttpd2qq %zmm1,%zmm0{%k7}
mem_load_retired.l1_hit 257 66.67 vmovapd %zmm2,(%r15,%r14,1)
mem_load_retired.l1_hit 258 33.33 add %r9,%r14
Perf further shows that 100% of the mem_load_retired.l1_miss instances occurred at line 257, and none at 231.
My question is: how can 100% of the L1 cache hits occur at two parts of the code separated by 26 lines?
UPDATE: Following comment below by Peter Cordes, I removed all unneeded line labels, and the distribution of hits is different:
mem_load_retired.l1_hit 230 16.67 vmulpd %zmm28,%zmm0,%zmm1
mem_load_retired.l1_hit 231 16.67 vcvttpd2qq %zmm1,%zmm0{%k7}
mem_load_retired.l1_hit 232 16.67 vcvtuqq2pd %zmm0,%zmm2{%k7}
mem_load_retired.l1_hit 257 33.33 vmovapd %zmm2,(%r15,%r14,1)
mem_load_retired.l1_hit 258 16.67 add %r9,%r14
There is a necessary loop label between these two sections, but apparently because it has a return label to jump back, perf does not count it as a label designating a new function. The numbers above add up to 100%, so nothing is duplicated as before. This result would also apply to C and C++ with labels.