4

I'm fighting memory latency using memory prefetching. Some (older) CPUs from Intel support performance counters for counting the cycles a CPU wasted with waiting for memory (stalled-cycles-backend), e.g. Intels E5-2690.

On newer CPUs (Gold 6230 and Gold 6226 for example) I can not find this counter. Is there another way to count the cycles a CPU wasted with waiting for the memory controller to load cache lines?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
jagemue
  • 363
  • 4
  • 16
  • Skylake's `resource_stalls.any` counter might be what you're looking for. Not sure if that's exactly equivalent to `stalled-cycles-backend` on Sandybridge. – Peter Cordes Oct 31 '19 at 10:26
  • Oh, if you want memory stalls specifically, there are much more specific events; search through `perf list` output for what you're looking for. e.g. from my SKL (Skylake-client) `mem_load_retired.l3_miss` counts load insns specifically (not cycles). Or perhaps `cycle_activity.stalls_l3_miss` counts *Execution stalls while L3 cache miss demand load is outstanding*. That's not the same as cycles with no uops delivered, just none executed, so I assume it can count even when the ROB / RS isn't full. – Peter Cordes Oct 31 '19 at 10:31
  • Thanks Peter, I will give `cycle_activity.stalls_l3_miss` a try. – jagemue Oct 31 '19 at 10:37

2 Answers2

4

The event that perf calls "stalled-cycles-backend" is a "generic" event that is implemented differently on different processor models. The definitions are a pain to find, but in the CentOS 7.6 kernel source, the definitions are in "arch/x86/events/intel/core.c". For Sandy Bridge (Xeon E5-26xx), the definition is Event 0xB1, Umask 0x01, INV=1, CMASK=1. Looking up this event in Chapter 19 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384-071, October 2019), Table 19-3 says that on Skylake Xeon (and Cascade Lake Xeon), this event means the same thing: "Counts cycles during which no uops were dispatched from the Reservation Station (RS) per thread."

I recommend against using these "generic" events if you want to understand what is being counted. It is a pain to go hunting in the kernel source for the definitions, or to build a test program to read the actual MSRs that perf programs. The first one that I tested today is actually wrong -- on a Xeon E5 v4 system, the event "uops_executed.core_cycles_none" is programmed as Event 0xb1, Umask 0x02, INV=1, but the CMASK is not set to 1. Section 18.2 of Volume 3 of the SWDM says that INV is ignored if CMASK is zero, so this actually counts total Uops executed, not cycles with no Uops executed. (The same event is programmed correctly on an SKX box running exactly the same kernel.)

An example that counts total cycles, cycles with no Uops dispatched, and cycles with at least one Uop dispatched while running the Intel Memory Latency Tester:

perf stat -e r0043003c -e r01c301b1 -e r014301b1 ./mlc --idle_latency
  Intel(R) Memory Latency Checker - v3.7
  Command line parameters: --idle_latency 

  Using buffer size of 2000.000MiB
  *** Unable to modify prefetchers (try executing 'modprobe msr')
  *** So, enabling random access for latency measurements
  Each iteration took 182.4 core clocks (   87.1    ns)


 Performance counter stats for './mlc --idle_latency':

    91,815,806,587      r0043003c                                                   
    64,132,006,584      r01c301b1                                                   
    27,683,941,060      r014301b1                                                   

      14.587156882 seconds time elapsed
John D McCalpin
  • 2,106
  • 16
  • 19
  • The absence of `cmask` for the event `uops_executed.core_cycles_none` is probably a bug, which seems to exist on Broadwell and earlier. – Hadi Brais Dec 03 '19 at 20:30
2

stalled-cycles-frontend is supported only on Nehalem, Westmere, Sandy Bridge, and Ivy Bridge. It's mapped to event 0x0e, umask=0x01, inv=1, cmask=1 on all of these microarchitectures. stalled-cycles-backend is supported on Nehalem, Westmere, and Sandy Bridge. On the first two, it's mapped to event=0xb1, umask=0x3f, inv=1, cmask=1. On SnB, it's mapped to event=0xb1, umask=0x01, inv=1, cmask=1.

Starting with kernel v4.6-rc1, if any of these events is not supported on the current processor, it's not shown in the output of the perf stat. In earlier versions of the kernel, it'll show <not supported>.

Andi Kleen (Intel) said in this thread that event=0xb1, umask=0x01, inv=1, cmask=1 is not (officially) supported on Ivy Bridge and the table from the manaul that lists the event is outdated. That's why stalled-cycles-backend is not supported on IvB. But according to Table 19-15 of the manual V3 (May 2019), it's still listed for IvB. It's also listed for Broadwell and later, but not Haswell. However, the Performance Monitoring Events Manual does list it for Haswell. Perhaps it's buggy on Haswell? I don't know.

According to another thread, these two events appear to have been fully abandoned starting with Haswell in favor of the first level cycles breakdown of the top-down methodology.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • I have a Kaby Lake and a lot of counters in the Broadwell table on Vol 3B that don't appear on the Kaby Lake table work (when I benchmark an app using a driver where I directly program the PMC EVTSELs using wrmsr and then rdpmc - rdpmc) and show different, but feasible values to the supported counters with the same event but diffent Umask. For instance UOPS_ISSUED.ANY and FLAGS_MERGE. They constantly have substantially different values more than the number of uops between the rdpmc instructions (4*2) for the 2 different PMCs benchmarking the supported and unsupported counter for the same run – Lewis Kelsey Apr 17 '21 at 05:20
  • I meant UOPS_RETIRED.ALL not UOPS_ISSUED.ANY. Take a look https://imgur.com/sicAwoT – Lewis Kelsey Apr 17 '21 at 12:52