5

I'm trying to apply some performance engineering techniques to an implementation of Dijkstra's algorithm. In an attempt to find bottlenecks in the (naive and unoptimised) program, I'm using the perf command to record the number of cache misses. The snippet of code that is relevant is the following, which finds the unvisited node with the smallest distance:

for (int i = 0; i < count; i++) {
    if (!visited[i]) {
        if (tmp == -1 || dist[i] < dist[tmp]) {
            tmp = i;
        }
    }
}

For the LLC-load-misses metric, perf report shows the following annotation of the assembly:

       │             for (int i = 0; i < count; i++) {                                                                                                                           ▒
  1.19 │ ff:   add    $0x1,%eax                                                                                                                                                  ▒
  0.03 │102:   cmp    0x20(%rsp),%eax                                                                                                                                            ▒
       │     ↓ jge    135                                                                                                                                                        ▒
       │                 if (!visited[i]) {                                                                                                                                      ▒
  0.07 │       movslq %eax,%rdx                                                                                                                                                  ▒
       │       mov    0x18(%rsp),%rdi                                                                                                                                            ◆
  0.70 │       cmpb   $0x0,(%rdi,%rdx,1)                                                                                                                                         ▒
  0.53 │     ↑ jne    ff                                                                                                                                                         ▒
       │                     if (tmp == -1 || dist[i] < dist[tmp]) {                                                                                                             ▒
  0.07 │       cmp    $0xffffffff,%r13d                                                                                                                                          ▒
       │     ↑ je     fc                                                                                                                                                         ▒
  0.96 │       mov    0x40(%rsp),%rcx                                                                                                                                            ▒
  0.08 │       movslq %r13d,%rsi                                                                                                                                                 ▒
       │       movsd  (%rcx,%rsi,8),%xmm0                                                                                                                                        ▒
  0.13 │       ucomis (%rcx,%rdx,8),%xmm0                                                                                                                                        ▒
 57.99 │     ↑ jbe    ff                                                                                                                                                         ▒
       │                         tmp = i;                                                                                                                                        ▒
       │       mov    %eax,%r13d                                                                                                                                                 ▒
       │     ↑ jmp    ff                                                                                                                                                         ▒
       │                     }                                                                                                                                                   ▒
       │                 }                                                                                                                                                       ▒
       │             }   

My question then is the following: why does the jbe instruction produce so many cache misses? This instruction should not have to retrieve anything from memory at all if I am not mistaken. I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using L1-dcache-load-misses shows that there are a lot of cache misses in that instruction.

This stumps me somewhat. Could anyone explain this (in my eyes) odd result? Thank you in advance.

StephenSwat
  • 302
  • 3
  • 9
  • Stephen, what is your exact CPU model (to find microarchitecture name and set of available PMU events). LLC-load-misses is synthetic perf event which maps to different hardware PMU events (also check `pref record ... -vvv ./program`), some of them are exact and other are not exact. For **inexact events, wrong instruction address will be reported**, sometimes with small skew, sometimes with large skew (even for different function). There are tens exact events "PEBS" in Intel - try `dmesg | grep -i pebs`. Try `ocperf` tool from Intel's https://github.com/andikleen/pmu-tools to use real HW events – osgx May 05 '17 at 00:28
  • Thank you for your response, osgx. The CPU I'm running this on is an Intel Core i5 750 (Nehalem). I was not aware that the `LLC-load-misses` was an inexact event and had assumed it would be exact since `perf list` lists it as a hardware cache event. If I understand correctly I need to use the `rXXXXXXX` raw hardware events, however I don't see how I can infer the exact event name from the Nehalem performance engineering manual I have open here: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf – StephenSwat May 05 '17 at 00:46
  • Stephen, not all hardware events are exact. Use ocperf to get intel event names without manual conversion into rXXXX raw codes. – osgx May 05 '17 at 01:01
  • Stephen, and what is your linux kernel version (or linux distribution version)? Try `cachegrind` profiler/cache simulator tool of `valgrind` program and `kcachegrind` GUI to get basic ideas about how the code works (exact instruction execution counts for every instr), where are the hot paths and where are cache-intensive code. But note it is just simulator and its results are not equal to real cpu run and real cache methods; and it is very slow - expect 20-30 times slower runs under valgrind compared to native. – osgx May 05 '17 at 01:14
  • PS: there are two instructions near the high counter: "ucomis (%rcx,%rdx,8),%xmm0 + ↑ jbe", where ucomis is [some kind of compare](http://www.felixcloutier.com/x86/UCOMISS.html) with memory argument. Intel CPUs can fuse some combinations of operations (http://www.realworldtech.com/nehalem/5/ "into a single uop, CMP+JCC") together, and cmp + conditional jump is common instruction to be fused (you can check it with [Intel IACA simulating tool ver 2.1](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)). Fused pair commonly reported in perf for one IP of two instructions. – osgx May 05 '17 at 02:29
  • PPS: this probably means that expression "`dist[i] < dist[tmp]`" generates two memory accesses, and both of values are used in `ucomis (%rcx,%rdx,8),%xmm0` instruction which is (partially?) fused with `jbe` conditional jump. Either `dist[i]` or `dist[tmp]` or both expressions generates high number of misses. (You may try to merge `visited[N]` and `dist[N]` arrays into array[N] of `struct { int visited; float dist}` or try to change order of vertex access, or do some prefetch for next 1 or more elements (?)) – osgx May 05 '17 at 02:31

2 Answers2

7

About your example:

There are several instructions before and at the high counter:

        │       movsd  (%rcx,%rsi,8),%xmm0
   0.13 │       ucomis (%rcx,%rdx,8),%xmm0
  57.99 │     ↑ jbe    ff

"movsd" loads word from (%rcx,%rsi,8) (some array access) into xmm0 register, and "ucomis" loads another word from (%rcx,%rdx,8) and compares it with just loaded value in xmm0 register. "jbe" is conditional jump which depends on compare outcome.

Many modern Intel CPUs (and AMD probably too) can and will fuse (combine) some combinations of operations (realworldtech.com/nehalem/5 "into a single uop, CMP+JCC") together, and cmp + conditional jump very common instruction combination to be fused (you can check it with Intel IACA simulating tool, use ver 2.1 for your CPU). Fused pair may be reported in perf/PMUs/PEBS incorrectly with skew of most events towards one of two instructions.

This code probably means that expression "dist[i] < dist[tmp]" generates two memory accesses, and both of values are used in ucomis instruction which is (partially?) fused with jbe conditional jump. Either dist[i] or dist[tmp] or both expressions generates high number of misses. Any of such miss will block ucomis to generate result and block jbe to give next instruction to execute (or to retire predicted instructions). So, jbe may get all fame of high counters instead of real memory-access instructions (and for "far" event like cache response there is some skew towards last blocked instruction).

You may try to merge visited[N] and dist[N] arrays into array[N] of struct { int visited; float dist} to force prefetching of array[i].dist when you access array[i].visited or you may try to change order of vertex access, or renumber graph vertex, or do some software prefetch for next one or more elements (?)


About generic perf event by name problems and possible uncore skew.

perf (perf_events) tool in Linux uses predefined set of events when called as perf list, and some listed hardware events can be not implemented; others are mapped to current CPU capabilities (and some mappings are not fully correct). Some basic info about real PMU is in your https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details for related Nehalem-EP variant).

For your Nehalem (Intel Core i5 750 with L3 cache of 8MB and without multi-CPU/multi-socket/NUMA support) perf will map standard ("Generic cache events") LLC-load-misses event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation of perf event mappings (the only one) - kernel source code

http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103

 u64 nehalem_hw_cache_event_ids ...
[ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        /* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
        [ C(RESULT_ACCESS) ] = 0x01b7,
        /* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
        [ C(RESULT_MISS)   ] = 0x01b7,
...
/*
 * Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
 * See IA32 SDM Vol 3B 30.6.1.3
 */
#define NHM_DMND_DATA_RD    (1 << 0)
#define NHM_DMND_READ       (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
 u64 nehalem_hw_cache_extra_regs
  ..
 [ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
        [ C(RESULT_MISS)   ] = NHM_DMND_READ|NHM_L3_MISS,

I think this event is not precise: cpu pipeline will post (with out-of-order) load request to the cache hierarchy and will execute other instructions. After some time (around 10 cycles to reach and get response from L2 and 40 cycles to reach L3) there will be response with miss flag in the corresponding (offcore?) PMU to increment counter. On this counter overflow, profiling interrupt will be generated from this PMU. In several cpu clock cycles it will reach pipeline to interrupt it, perf_events subsystem's handler will handle this with registering current (interrupted) EIP/RIP Instruction pointer and reset PMU counter back to some negative value (for example, -100000 to get interrupt for every 100000 L3 misses counted; use perf record -e LLC-load-misses -c 100000 to set exact count or perf will autotune limit to get some default frequency). The registered EIP/RIP is not the IP of load command and it may be also not the EIP/RIP of command which wants to use the loaded data.

But if your CPU is the only socket in the system and you access normal memory (not some mapped PCI-express space), L3 miss in fact will be implemented as local memory access and there are some counters for this... (https://software.intel.com/en-us/node/596851 - "Any memory requests missing here must be serviced by local or remote DRAM").

There are some listings of PMU events for your CPU:

There should be some information about ANY_LLC_MISS offcore PMU event implementation and list of PEBS events for Nhm, but I can't find it now.

I can recommend you to use ocperf from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without need to manually encode them. There are some PEBS events in your CPU, and there is Latency profiling / perf mem for some kind of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access sampling support",RH 2013 - pg26-30, still not documented in 2015 - sowa pg19, ls /sys/devices/cpu/events). For newer CPUs there are newer tools like ucevent.

I also can recommend you to try cachegrind profiler/cache simulator tool of valgrind program with kcachegrind GUI to view profiles. Valgrind-based profilers may help you to get basic idea about how the code works: they collect exact instruction execution counts for every instruction, and cachegrind also simulates some abstract multi-level cache. But real CPU will execute several instruction per cycle (so, callgrind/cachegrind cost model of 1 instruction = 1 cpu clock cycle gives some error; cachegrind cache model have not the same logic as real cache). And all valgrind tools are dynamic binary instrumentation tools which will slow down your program 20-30 times compared to native run.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • Thank you for your extensive explanation. You've surely helped me a great deal in understanding how these events work. – StephenSwat May 05 '17 at 10:27
  • Thank you for this great answer, you are really an expert on perf and low-level programming, could you please have a look at my similar question here:https://stackoverflow.com/questions/63990981/why-does-cmp-instruction-cost-too-much-time – prehistoricpenguin Sep 21 '20 at 11:05
  • FP compares don't fuse with branches, only some integer ops. – Peter Cordes Jun 05 '23 at 06:12
0

When you read a memory location, the processor will try to prefetch the adjacent memory locations and cache them.

That works well if you are reading an array of objects which are all allocated in memory in contiguous blocks.

However, if for example you have an array of pointers which live in the heap, it is less likely that you will be iterating over contiguous portions of memory unless you are using some sort of custom allocator specifically designed for this.

Because of this, dereferencing should be seen as some sort of cost. An array of structs can be more efficient to an array of pointers to structs.

Herb Sutter (member of C++ commitee) speaks about this in this talk https://youtu.be/TJHgp1ugKGM?t=21m31s

arboreal84
  • 2,086
  • 18
  • 21