2

In the question cpu cache performance. store misses vs load misses, there is no answer about where to find documents of events listed by perf list

I can't find it by man perf and perf help list,

I read the Event document of Intel@64 and AMD64, which the event format looks like the following Last Level Cache References — Event select 2EH, Umask 4FH So where is it?

Edit: To be clear, I want to look for the document of the event list by perf list

Ryan Chen
  • 150
  • 12
  • 2
    [Use `ocperf.py`](https://github.com/andikleen/pmu-tools) for symbolic names for CPU-specific events like `l2_lines_in.all` or `uops_issued.any`. (And `ocperf.py list` shows them with brief documentation on what they mean). – Peter Cordes Feb 28 '18 at 10:47
  • That is not the official document, and is not complete. – Ryan Chen Feb 28 '18 at 11:07
  • 1
    I know it's not official, but it's extremely useful in practice. That's why I posted a comment, not an answer. Which events is it missing? I thought it was more or less complete, except maybe for some stuff you can do with masks / options for some events. – Peter Cordes Feb 28 '18 at 11:11
  • Sorry for being rude, and thank you for your comment. when I use `ocperf.py list` it shows document for the implement-dependent events. Start from `arith.divider_uops` . But I want to know the original perf command document from `branch-instructions OR branches` to `mem` – Ryan Chen Feb 28 '18 at 11:18
  • 1
    When you run `ocperf.py stat -e whatever ./a.out`, it prints the `perf` command before running it, like `perf stat -e cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ ./a.out`. (Numeric codes only for events that `perf` doesn't know by name, though. There may be a verbose option for `perf` or `ocperf.py`). I think that's close to what you're asking about. – Peter Cordes Feb 28 '18 at 11:46
  • I see, let me do the experiment. – Ryan Chen Feb 28 '18 at 11:51
  • No luck for me, when I run `ocperf.py stat -e whatever ./a.out`, it prints `perf stat -e whatever ./a.out` if whatever is origin command like "branches", "L1-dcache-loads", it only follows what you say for commands start from `arith.divider_uops` – Ryan Chen Feb 28 '18 at 11:56
  • 1
    Are you trying to find out the list of pre-defined events supported by your version of perf ? If so, then directly run `perf list`, which will offer you a list of events that you can use for measurement. If in any event that you are not being able to use any of them, you should try using events by their symbolic names - (that is where `ocperf.py` will be of help as suggested by Peter). – Arnabjyoti Kalita Mar 01 '18 at 19:01

1 Answers1

4

List of predefined perf events like branches cycles LLC-load-misses is documented by the source code of perf subsystem inside Linux kernel. The list is mapped partially and to various hardware event for different CPU models and microarchitectures. It can be more useful to use ocperf.py (and toplev.py) from andikleen's pmu-tools (if your CPU is Intel) with event names from Intel documentations (ocperf is not official, but it is written by Intel employee and uses official lists from https://download.01.org/perfmon/ https://download.01.org/perfmon/readme.txt "This package contains performance monitoring event lists for Intel processors")

For x86 and x86_64 perf these (ancient) predefined/generic names are mapped at arch/x86/events directory, for example for all Intel Core microarchitecures check arch/x86/events/intel/core.c and search for microarchitecture by its code name (Core, Core2, NHM=Nehalem, WSM=Westmere, SNB=SandyBridge, IVB=IvyBridge, HSW=HaSWell, BDW=BroaDWell,SKL=SKyLake, SLM=SiLverMont and other from lists and amd). For Skylake there is structure at line 394 of intel/core.c of 4.15.8, and we see that PREFETCH counters are not mapped for all caches ("not supported")

 static __initconst const u64 skl_hw_cache_event_ids

 [ C(L1D ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x81d0,  /* MEM_INST_RETIRED.ALL_LOADS */
        [ C(RESULT_MISS)   ] = 0x151,   /* L1D.REPLACEMENT */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x82d0,  /* MEM_INST_RETIRED.ALL_STORES */
        [ C(RESULT_MISS)   ] = 0x0,

...
 [ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0x1b7,   /* OFFCORE_RESPONSE */
        [ C(RESULT_MISS)   ] = 0x1b7,   /* OFFCORE_RESPONSE */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = 0x1b7,   /* OFFCORE_RESPONSE */
        [ C(RESULT_MISS)   ] = 0x1b7,   /* OFFCORE_RESPONSE */
    },

and extra structure to define additional flags/masks for events like OFFCORE_RESPONSE:

static __initconst const u64 skl_hw_cache_extra_regs 
 [ C(LL  ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = SKL_DEMAND_READ|
                       SKL_LLC_ACCESS|SKL_ANY_SNOOP,
        [ C(RESULT_MISS)   ] = SKL_DEMAND_READ|
                       SKL_L3_MISS|SKL_ANY_SNOOP|
                       SKL_SUPPLIER_NONE,
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = SKL_DEMAND_WRITE|
                       SKL_LLC_ACCESS|SKL_ANY_SNOOP,
        [ C(RESULT_MISS)   ] = SKL_DEMAND_WRITE|
                       SKL_L3_MISS|SKL_ANY_SNOOP|
                       SKL_SUPPLIER_NONE,
 [ C(NODE) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = SKL_DEMAND_READ|
                       SKL_L3_MISS_LOCAL_DRAM|SKL_SNOOP_DRAM,
        [ C(RESULT_MISS)   ] = SKL_DEMAND_READ|
                       SKL_L3_MISS_REMOTE|SKL_SNOOP_DRAM,
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = SKL_DEMAND_WRITE|
                       SKL_L3_MISS_LOCAL_DRAM|SKL_SNOOP_DRAM,
        [ C(RESULT_MISS)   ] = SKL_DEMAND_WRITE|
                       SKL_L3_MISS_REMOTE|SKL_SNOOP_DRAM,
osgx
  • 90,338
  • 53
  • 357
  • 513