Profiling Cache hit rate of a function of C program

Question

I want to get cache hit rate for a specific function of a C/C++ program (foo) running on a Linux machine. I am using gcc and no compiler optimization. With perf I can get hit rates for the entire program using the following command.

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out

But I am interested in the kernel foo only.

Is there a way to get hit rates only for foo using perf or any other tool?

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>


#define NI 192
#define NJ NI

#ifndef DATA_TYPE
    #define DATA_TYPE float
#endif


static 
void* xmalloc(size_t num)
{
    void * nnew = NULL;
    int ret = posix_memalign (&nnew, 32, num);
    if(!nnew || ret)
    {
        fprintf(stderr, "Can not allocate Memory\n");
        exit(1);
    }
    return nnew;
}

void* alloc_data(unsigned long long int n, int elt_size)
{
    size_t val = n;
    val *= elt_size;
    void* ret = xmalloc(val);
    return ret;
}


/* Array initialization. */
static
void init_array(int ni, int nj,
        DATA_TYPE A[NI][NJ],
        DATA_TYPE R[NJ][NJ],
        DATA_TYPE Q[NI][NJ])
{
  int i, j;

  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      A[i][j] = ((DATA_TYPE) i*j) / ni;
      Q[i][j] = ((DATA_TYPE) i*(j+1)) / nj;
    }
  for (i = 0; i < nj; i++)
    for (j = 0; j < nj; j++)
      R[i][j] = ((DATA_TYPE) i*(j+2)) / nj;
}


/* Main computational kernel.*/

static
void foo(int ni, int nj,
        DATA_TYPE A[NI][NJ],
        DATA_TYPE R[NJ][NJ],
        DATA_TYPE Q[NI][NJ])
{
  int i, j, k;

  DATA_TYPE nrm;
  for (k = 0; k < nj; k++)
  {
    nrm = 0;
    for (i = 0; i < ni; i++)
      nrm += A[i][k] * A[i][k];
    R[k][k] = sqrt(nrm);
    for (i = 0; i < ni; i++)
      Q[i][k] = A[i][k] / R[k][k];
    for (j = k + 1; j < nj; j++)
    {
      R[k][j] = 0;
      for (i = 0; i < ni; i++)
        R[k][j] += Q[i][k] * A[i][j];
      for (i = 0; i < ni; i++)
        A[i][j] = A[i][j] - Q[i][k] * R[k][j];
    }
  }
}


int main(int argc, char** argv)
{
  /* Retrieve problem size. */
  int ni = NI;
  int nj = NJ;

  /* Variable declaration/allocation. */
  DATA_TYPE (*A)[NI][NJ];
  DATA_TYPE (*R)[NI][NJ];
  DATA_TYPE (*Q)[NI][NJ];

  A = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  R = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  Q = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  
/* Initialize array(s). */
  init_array (ni, nj,
          (*A),
          (*R),
          (*Q));


  /* Run kernel. */
  foo (ni, nj, *A, *R, *Q);

  /* Be clean. */
  free((void *)A);
  free((void *)R);
  free((void *)Q);

  return 0;
}

Output of lscpu command is:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15 
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel 
CPU family:            6
Model:                 63
Model name:            Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
Stepping:              2
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-15

What your want is a caliper measurement: a "start counter" before calling foo() and a "stop counter" upon the end of foo(). To make it, you will need to instrument the code and rebuild it. The ability to get the counters depends on the processor architecture and its PMU. The way to get the counters is vendor specific. That is why libraries like papi are useful as they support multiple processor/PMU architectures transparently. Why were you not able to use papi ? — Rachid K., Oct 29 '20 at 07:14
@hyde: That would include counts for dynamic-linking, and for the alloc / initialize part. You can count only user-space by using `perf stat --all-user` (or with older `perf`, with `event:u,event:u,...`) So yeah you could just time the whole program if you can repeat `foo` a lot of times to drown out the background noise of the init work; if it can be run multiple times without redoing its init. But that may be impractical if you want to run `foo` with a large array that includes a lot of init time. — Peter Cordes, Oct 29 '20 at 07:15
@PeterCordes Could use static linking. Could precalculate the array. — hyde, Oct 29 '20 at 07:21
@hyde, yes, but with large static data the first run of `foo()` will hit page faults, instead of reading recently-written memory that's probably hot in the TLBs, and very likely at least won't page-fault, so some HW prefetching and OoO exec across page boundaries will be available even if it's just looping over a big contiguous array. If it's a pointer-based data structure (like a tree), mmaping with `MAP_FIXED` is plausible but a big pain to create data for, and to use. And not viable if `foo` involves `free()` on any of the initialized data. — Peter Cordes, Oct 29 '20 at 07:24
@Richard K. I basically tried PAPI for this. Tried to use this formula L1-D-Cache hit rate = 1 - PAPI_L1_DCM/(PAPI_LD_INS+PAPI_SR_INS) — Atanu Barai, Oct 29 '20 at 07:38
But this is returning me error code -8(Event exists, but cannot be counted due to counter resource limitations) when I try to add those events using **PAPI_add_event** function. It fails when I try to add three events. If I add only two events, it works fine. — Atanu Barai, Oct 29 '20 at 07:51
@PeterCordes Sure. Depends on the exact purpose, whether that matters much or not. On a general purpose multi-user multi-tasking OS, there's going to be a lot of noise on a shared resource like CPU cache anyway. High process priority might be useful too. — hyde, Oct 29 '20 at 08:16
@AtanuBarai: When you use `perf` from the command line, does it let you use more than 2 events (other than instructions or cycles which have "fixed" counters) without having to multiplex the counters? (extra column of % of time this counter was active). Have you disabled the NMI-watchdog that uses up one programmable counter? — Peter Cordes, Oct 29 '20 at 22:42
@Peter Cordes No perf doesn't. I guess there are not enough counters in the CPU for that. So, one solution might be add only two events at a time, get counter values and in the next run add other counter and get the counter value. Disabling NMI-watchdog is a good idea. I will try it. Thanks — Atanu Barai, Oct 31 '20 at 02:03
@AtanuBarai: What is your CPU architecture ? If it is Intel, can you edit your question and add the result of the "cpuid -1" command ? The latter is verbose, just share the header (CPU identification) and the PMU part (Architecture Performance Monitoring Features entries). — Rachid K., Oct 31 '20 at 07:11
I don't understand why this question has been closed: it is clear... — Rachid K., Oct 31 '20 at 07:16
With cpuid, you would get information about the PMU as well (especially the number of counters in the sections "Architecture Performance Monitoring Features" — Rachid K., Oct 31 '20 at 22:22
According to [Intel documentation](https://www.intel.fr/content/www/fr/fr/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html): Intel Core i7 processor family supports architectural performance monitoring provides four general-purpose performance counters (IA32_PMC0, IA32_PMC1, IA32_PMC2, IA32_PMC3) and three fixed-function performance counters (IA32_FIXED_CTR0, IA32_FIXED_CTR1, IA32_FIXED_CTR2) in the processor core. — Rachid K., Oct 31 '20 at 22:35

breiters · Answer 1 · 2021-02-05T11:50:46.357

You could also use Likwid and its Marker-API. It makes it very easy to instrument certain regions of your code. You can use the predefined performance group ICACHE on the haswell architecture for the L1 cache miss rate or define your own performance group for the L1 hit rate.

#include likwid.h
LIKWID_MARKER_INIT;
LIKWID_MARKER_START("region foo");

foo();

LIKWID_MARKER_STOP("region foo");
LIKWID_MARKER_CLOSE;

run application with:

./likwid-perfctr -g ICACHE -m <your application>

Make sure to compile with -DLIKWID-PERFMON and add the Likwid include and library path and link the Likwid library: -L$LIKWID_LIB -I$LIKWID_INCLUDE -llikwid. Everything is documented very well on their github wiki

Basile Starynkevitch · Answer 2 · 2020-10-31T07:38:21.560

You might be interested in gprof(1). It won't measure cache hit rate (this has no sense, since some calls to foo could be inlined, once GCC is invoked with optimizations enabled).

You could use libbacktrace in your code. See also time(7) and signal(7).

You might compile your code with gcc -Wall -Wextra -O2 -g -pg then use libbacktrace (like GCC or RefPerSys are doing) inside it, and later gprof(1) with gdb(1).

With efforts (so read Advanced Linux Programming then syscalls(2) and signal-safety(7)) you might use setitimer(2) with sigaction(2) and/or profil(3).

Consider also generating some C code (e.g. using GPP and/or GNU bison in your own C code generator) and see this answer. J.Pitrat's book Artificial Beings: the Conscience of a Conscious Machine (ISBN-13: 978-1848211018) could be inspirational. You may want to generate some C code for extra instrumentation.

You might generate some code in a plugin (e.g. with libgccjit or GNU lightning...) at runtime, then dlopen(3) and dlsym(3) it. Read more about partial evaluation and see my manydl.c example, and more seriously the source code of Ocaml or of SBCL.

You could write your GCC plugin to automatically generate some measurements, in a more clever way than what the -pg option of GCC is doing. Your GCC plugin would transform (at the GIMPLE level) most function calls to something more complex making some benchmarking (this is how -pg works inside GCC, and you might study the source code of GCC). Try compiling your foo.c as gcc -Wall -Wextra -O2 -pg -S -fverbose-asm foo.c and look into the generated foo.s, perhaps adding more optimizations, or static analysis or instrumentation options.

You could be interested in recent papers of ACM SIGPLAN.

At last, benchmarking a C program compiled without optimizations makes no sense. Consider instead compiling and linking your program with at least gcc -flto -O2 -Wall

Within your foo, you might use cleverly clock_gettime(2) to measure CPU time.

And if performance is very important and if you are allowed to spend weeks of work to improve it, you might consider using OpenCL (or perhaps CUDA) to compute your kernel on a powerful GPGPU. Of course, you need dedicated hardware. Otherwise, consider using OpenMP or OpenACC (or perhaps MPI). Some recent GCC compilers (at least GCC 10 in October 2020) could support these. Of course, read documentation on Invoking GCC.

Thanks for your answer. But I don't see it's relevant with the question. — Atanu Barai, Oct 29 '20 at 07:06
We don't understand your question. Consider improving it with some [mre] coded in C. And you should use [profil(3)](https://man7.org/linux/man-pages/man3/profil.3.html) — Basile Starynkevitch, Oct 29 '20 at 07:07

score 1 · Answer 3 · answered Dec 02 '20 at 12:46

First, note that L1-dcache-store-misses is not supported on your processor. perf stat will tell you that in the output.

perf stat doesn't let you profile only selected regions of code. To do that, you have to manually instrument the code so that the specified events are controlled around the regions of interest as desired.

It's not possible to count the events L1-dcache-loads, L1-dcache-load-misses, and L1-dcache-stores without multiplexing on your processor (Haswell). They're mapped to the native events MEM_UOPS_RETIRED.ALL_LOADS, L1D.REPLACEMENT, and MEM_UOPS_RETIRED.ALL_STORES, respectively. Each of these events can only be counted the first four general-purpose counters. In addition, there is a bug that is not documented in the specification update document of the i7-5960X, but does exist in the i7-5960X (it's documented in the spec update documents of other Haswell processors and processors of some other microarchitectures). This bug is handled differently in different versions of perf. Starting with kernel version 4.1-rc7, if one of the events that is affected by the bug is enabled on a logical core and if hyperthreading is enabled at boot-time, a logical core can only use up to two of its four general-purpose counters. The MEM_UOPS_RETIRED.* events are among the ones affected by the bug. One thing you can do is disable hyperthreading.

It's important to understand what kind of "cache hit rate" can be measured with these events. You probably don't want to measure something that doesn't make sense. One ratio that may make sense is L1-dcache-load-misses / (L1-dcache-loads + L1-dcache-stores), which represents the number of L1D replacements (lines filled in the cache that cause others to be evicted) for any reason divided by the number of retired load and store uops. Not all misses cause replacements and a significant portion of all misses may hit in the LFBs, which also don't cause replacements. Also not all replacement are caused by accesses from uops that end up retiring.

Profiling Cache hit rate of a function of C program

3 Answers3