0

I would like to find out how to trace rescheduling interrupts.

As I noticed that my application had some involuntary context switches and then cat /proc/interrupts shows that rescheduling interrupts happened.

I suspected that it has nothing to do with my application and so I created a dummy application:

#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cstdint>

int main(int argc, char** argv) {
        srand(time(nullptr));
        int32_t i = 0;
        while (i != rand()) i = rand() * -1;
        printf("%d", i);
}

Which basically never exits.

I compiled it: /opt/rh/devtoolset-8/root/usr/bin/g++ -Wall dummy.cpp -o a -march=native -mtune=native -O0 -fno-omit-frame-pointer -std=c++17

Then, I ran taskset -c 5 ./a and then cat /proc/interrupts every second or so. I notice that Rescheduling Interrupts increases by 1-2 every second, which I don't expect. Local Timer Interrupt also increases by 1 every second, which is expected. I already isolated core 5 in boot parameter.

Another machine's rescheduling interrupts only increases by 1 every 30 mins or so.

Hence, I am looking for a generic way to trace down this kind of interrupt issues so that I can reapply the same methodology in the future for different kinds of unexpected interrupts.

My kernel version:

# uname -a
Linux localhost 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I also have tried taskset -c 4 perf record -e cycles:pp -C 5 -g --freq=12000 to record the call graph but for some reason the call graph of the kernel functions wasn't created...only userspace's.

Update 1: Followed @osgx's suggestion, I ran

taskset -c 4 perf record -e irq_vectors:reschedule_entry -c 1 --call-graph dwarf,4096 taskset -c 5 ./a

And then perf report --call-graph

-  100.00%   100.00%  vector=253                                                                                                                                                                ▒
   - 76.47% _start                                                                                                                                                                              ▒
        __libc_start_main                                                                                                                                                                       ▒
      - main                                                                                                                                                                                    ▒
         - 64.71% rand                                                                                                                                                                          ▒
            - 58.82% __random                                                                                                                                                                   ▒
               - 29.41% 0xffffffffb298e06a                                                                                                                                                      ▒
                    0xffffffffb2991864                                                                                                                                                          ▒
                    0xffffffffb22a4565                                                                                                                                                          ◆
                    0xffffffffb222f675                                                                                                                                                          ▒
                  - 0xffffffffb299042c                                                                                                                                                          ▒
                     - 23.53% 0xffffffffb22a41e5                                                                                                                                                ▒
                          0xffffffffb298f8da                                                                                                                                                    ▒
                          0xffffffffb2258ec2                                                                                                                                                    ▒
                     - 5.88% 0xffffffffb298f8da                                                                                                                                                 

kernel-debuginfo, kernel-debuginfo-common, glibc-debuginfo and glibc-debuginfo-common have been installed and -fno-omit-frame-pointer have been specified when compiling. Not sure why addresses show in the report instead of symbols. Idea?

HCSF
  • 2,387
  • 1
  • 14
  • 40
  • 1
    What's your `/proc/sys/kernel/perf_event_paranoid` set to? You might not have permission to backtrace the kernel stack. – Peter Cordes Mar 07 '20 at 04:24
  • @PeterCordes It is set to 2. But I was profiling as root. So shouldn't matter? – HCSF Mar 07 '20 at 14:03
  • 1
    Check value of /proc/sys/kernel/kptr_restrict too. To get some info about resched ipi, try to trace the `irq_vectors:reschedule_entry` tracepoint (if supported in your kernel; `perf list tracepoints|grep resched`) - https://elixir.bootlin.com/linux/v4.19/source/arch/x86/kernel/smp.c#L260; there is also unanswered https://stackoverflow.com/questions/29736822. Andikleen has two more tracepoints https://github.com/andikleen/pmu-tools/blob/master/linux_metrics.py#L86 – osgx Mar 07 '20 at 17:27
  • The kernel does try to defend itself from root to some degree, with strict settings. I would have suggested looking up what the paranoid level numbers mean, but more likely osgx's suggestion of `kernel/kptr_restrict` is the missing piece. – Peter Cordes Mar 08 '20 at 03:36
  • 1
    There seems no perf event or tracepoint for `smp_send_reschedule` function, check my update to https://stackoverflow.com/questions/29736822. You can try to [define `perf probe`](http://man7.org/linux/man-pages/man1/perf-probe.1.html) to it or to native_ variant of function, or to apic send_IPI. Or use some ftrace/`trace-cmd` magic to trace it. – osgx Mar 08 '20 at 04:21
  • @osgx `/proc/sys/kernel/kptr_restrict` is set to 0. According to [this](https://www.kernel.org/doc/Documentation/sysctl/kernel.txt), it seems kernel pointers will be revealed. `perf list tracepoints` doesn't have `resched`. CentOS 7's kernel 3.10 hasn't ported that tracepoint yet I guess. Let me install `kernel-debuginfo` and try out `perf probe`. I have disabled `apic` in boot parameter (before disabling, there was 100+ rescheduling interrupt/sec)...so I guess I don't have to probe apic send_IPI. – HCSF Mar 08 '20 at 15:56
  • @PeterCordes according to [this](https://www.kernel.org/doc/Documentation/sysctl/kernel.txt), it seems like the value of `perf_event_paranoid` doesn't matter if the user has `CAP_SYS_ADMIN`, which `root` has. – HCSF Mar 08 '20 at 16:04
  • 1
    HCSF, sorry, there was a typo and correct command is `perf list tracepoint|grep resched` – osgx Mar 08 '20 at 17:30
  • @osgx nice catch. I just checked again and actually the tracepoints are there: `irq_vectors:reschedule_entry [Tracepoint event]` `irq_vectors:reschedule_exit [Tracepoint event]`. – HCSF Mar 09 '20 at 03:49
  • @osgx I tried `taskset -c 4 perf record -e irq_vectors:reschedule_entry -C 5 -g --freq=12000` and `taskset -c 4 perf record -e irq_vectors:reschedule_exit -C 5 -g --freq=12000`. Both didn't see any rescheduling interrupt. Tho, `cat /proc/interrupts` clearly shows that `Rescheduling Interrupts` are increasing when my application is running on core 5. – HCSF Mar 09 '20 at 12:42
  • 2
    Try `perf stat -e irq_vectors:reschedule_entry` (with -a and -A, probably with -I 1000 to get 1 second prints); and `perf record -e irq_vectors:reschedule_entry -c 1` – osgx Mar 10 '20 at 03:32
  • @osgx both `taskset -c 4 perf stat -e irq_vectors:reschedule_entry -a -A -I 1000 taskset -c 5 ./dummy` and `taskset -c 4 perf record -e irq_vectors:reschedule_entry -c 1 -g taskset -c 5 ./dummy` are able to capture `reschedule_entry events`. Seems like `-C 5` doesn't work. – HCSF Mar 10 '20 at 07:58
  • @osgx `kernel-debuginfo`, `kernel-debuginfo-common`, `glibc-debuginfo` and `glibc-debuginfo-common` have been installed and `-fno-omit-frame-pointer` have been specified when compiling. Not sure why addresses show in the report instead of symbols. Any idea? – HCSF Mar 11 '20 at 12:36
  • `-fno-omit-frame-pointer` when compiling what? Your kernel? Or just the dummy test program? – Peter Cordes Mar 11 '20 at 18:58
  • @PeterCordes dummy test program. I am using the kernel from CentOS 7's repos. – HCSF Mar 12 '20 at 00:43
  • 1
    If you want backtraces in the kernel to work, you need to use something that works for the way the kernel was compiled. The kernel build will have used `-fomit-frame-pointer`. If you don't need backtraces there, just user-space after reaching `main`, then that's not an issue. – Peter Cordes Mar 12 '20 at 02:13
  • @PeterCordes I see. But after building a kernel with `-fomit-frame-pointer`, then it isn't the same kernel in a sense that the actual kernel I want to find out what causes the rescheduling interrupts. – HCSF Mar 12 '20 at 02:49
  • Right, of course not. So you'd have to build *and boot* a kernel with `-fno-omit-frame-pointer` (you typed it without the `no` in your last comment), or you have to use a perf call-graph method other than frame-pointer chains (e.g. dwarf CFI `.eh_frame` metadata) if you want kernel call-graphs to work. (Unless perf can use separate methods for user-space vs. kernel stack snapshots? If it can even do kernel stack snapshots at all; I wouldn't be surprised if that was disallowed for security reasons!) – Peter Cordes Mar 12 '20 at 02:53
  • @PeterCordes when I tried to run `perf record`, I requested with `--call-graph dwarf,4096`. Not sufficient? – HCSF Mar 12 '20 at 03:06
  • That's probably fine, IDK I haven't used stack captures much. I usually tune hot loops where I already know where the hot-spot is so I haven't had much need for that part of perf. I'm just saying that compiling with `-fno-omit-frame-pointer` is pretty pointless when you're using DWARF backtraces, since you made a point of saying you were doing that. – Peter Cordes Mar 12 '20 at 03:08
  • @osgx I also tried your suggestion by using `perf probe`: `perf probe --add native_smp_send_reschedule` and then `taskset -c 4 perf record -e probe:native_smp_send_reschedule -g --callgraph dwarf -c 1 taskset -c 5 ./a`. Interestingly, `perf report` states `Samples: 13 of event 'probe:native_smp_send_reschedule'` but it doesn't show even the name of the probed function. Idea? – HCSF Mar 13 '20 at 12:20

0 Answers0