0

I'm trying to profile an application which has both userspace and kernelspace code using perf. I tried every other possibility enabling various kernel configurations but I'm unable to get the instructions/cycles count which are in userspace/kernelspace alone. I tried using the ":u" and ":k extensions to instructions and cycles count, but all I get as reply is

$ perf stat -e cycles:u,instructions:u ls

 Performance counter stats for 'ls':

   <not supported>      cycles:u

   <not supported>      instructions:u

       0.006047045 seconds time elapsed

       0.000000000 seconds user
       0.008098000 seconds sys

However, running just for cycles/instructions gives a proper result something like below.

$ perf stat -e cycles,instructions ls

 Performance counter stats for 'ls':

          5362086      cycles
            528783      instructions              #    0.10  insn per cycle

       0.005487940 seconds time elapsed

       0.007800000 seconds user
       0.000000000 seconds sys

Note: ls is just used as an example here to highlight the issue.

I'm running Linux 5.4 and perf version 5.4.77.g1206eede9156. And, I'm running the above command on ARM board. Below are the configurations that I've enabled in the Linux kernel

CONFIG_PERF_EVENTS=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_KPROBES=y
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_FRAME_POINTER=y
CONFIG_FTRACE=y
CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBE_EVENTS=y
CONFIG_PROBE_EVENTS=y

Further, perf list on the command line lists hardware/software events and many more

$ perf list
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]
  duration_time                                      [Tool event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]

Kindly suggest how to fix this issue. Am I doing anything wrong?

RVR
  • 13
  • 3

1 Answers1

0

Works for me, 444,022 cycles:u for perf stat -e cycles:u ls. perf version 5.13.g62fb9874f5da, on Linux 5.12.15-arch1-1, on bare metal (x86-64 Skylake), with perf_event_paranoid=0.
(With modern perf you can also use perf stat --all-user to imply :u for all events.)

I'm guessing your ARM CPU's hardware perf counters don't support being programmed with a mask for privilege-level, so perf reports that there is no hardware counter capable of counting only user-space instructions.

AFAIK, there aren't hooks at every interrupt entry point to enable / disable HW counters; counting only kernel, only user, or both, is purely a hardware feature.

HW support is obviously essential for accurate counts, because in a software implementation the counters would still be counting until kernel code ran that saved the current counts. (And kernel code after restoring the state, before returning to user-space.) Also, it would make every interrupt and system call even more expensive, instead of only virtualizing perf counters by saving/restoring them every context switch between tasks/threads. So there are good reasons for the kernel not to support a loose attempt to do it in software even on CPUs that don't have HW support for a privilege mask.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks Peter for the explanation. Yes, probably the ARM CPU's hardware doesn't support the segregation for the counter between userspace and kernelspace. But what I tried was do a perf record -e cycles,instructions and then perf report --stdio --sort comm,dso.. then "Event count (approx.)" from the perf report is multiplied with the summed percentages of the libraries running in userspace and summed percentage value of kernel module. Would that be indicative of what stat/record would report with the :u/:k modifiers? – RVR Oct 15 '21 at 11:14
  • @RVR: Yeah, `perf record` and looking at the user vs. kernel breakdown of the samples is probably a good alternative, especially for events like cycles and instructions where no hot code can avoid generating counts for them. Even for events that vary more with workload details, it might not be biased much in one way or another. e.g. I don't see obvious mechanisms for the kernel to do something that's going to cause some event, but then have it not actually counted until returning to user-space, at least no more so than user-space doing the same thing. (Maybe like cache lines in/out?) – Peter Cordes Oct 15 '21 at 11:25
  • Many thanks for the explanation.. Further, whats with the calls to kernel.kallsyms? I see them listed while running the record command. any other system calls from the application/library would be listed under kernel.kallsyms? – RVR Oct 15 '21 at 11:47
  • @RVR: IDK, been a while since I looked in detail at a `perf record` output. If you don't find an existing Q&A about it and google comes up empty, you could ask a new question. – Peter Cordes Oct 15 '21 at 11:58
  • Sure thanks, I'll look into it.. – RVR Oct 15 '21 at 14:35
  • Hi Peter. Just want to hijack this thread with a new question. Whats the difference between cpu-clock and cycles event? Is it simply that the former is a software event and the latter is a hardware event? And how are task-clock and cpu-clock events different? – RVR Oct 22 '21 at 08:31
  • @RVR: `task-clock` is kernel accounting based on context switch times. `cycles` is I think mapped to `cpu_clk_unhalted.thread` on Skylake. Probably `cpu-clock` is mapped to the same event, but you could double-check. – Peter Cordes Oct 22 '21 at 08:37
  • Sure thanks @Peter. – RVR Oct 26 '21 at 05:21