Why is a kretprobe on sys_futex called less often than a corresponding kprobe?

Question

I am doing some tracing of various kernel functions and system calls, and establishing patterns between them which can be used for certain performance profiling.

One thing that I noticed is that sometimes, even in my simple testing application which spins up a few threads which play with some mutexes, I wouldn't get any calls to kretprobe__sys_futex, but I would get plenty to kprobe__sys_futex.

I assumed that this was because e.g. a thread was calling into sys_futex, and going to sleep or perhaps terminating, but I actually see the same processes calling sys_futex multiple times consecutively without the return probe ever noticing anything.

I then assumed the issue was in how I was filtering calls to kprobe__sys_futex, so I made a minimal example using BCC/eBPF to test this:

#! /usr/bin/env python

from bcc import BPF

b = BPF(text="""
BPF_HASH(call_count, int, int);

int kprobe__sys_futex() {
  int zero = 0;
  call_count.lookup_or_init(&zero, &zero);
  bpf_trace_printk("futex start\\n");
  call_count.increment(zero);
  return 0;
}

int kretprobe__sys_futex() {
  int zero = 0;
  int *val_p = call_count.lookup(&zero);
  if (val_p != NULL) {
      int val = *val_p;
      val--;
      call_count.update(&zero, &val);
      bpf_trace_printk("futex calls with no return: %d\\n", val);
  } else { bpf_trace_printk("unexpected futex return\\n"); }
  return 0;
}
""")

b.trace_print()

I notice that in all sorts of applications (a good example is mysql-server, which does regular futex operations even when idle - at least on my machine), many (often 10+) futex starts are printed before a message from the return probe.

Here's an example trace of the above program which I left running for a couple of minutes while I wrote this post:

... hundreds of lines of much the same as below
           gdbus-612   [001] .... 211229.997665: 0x00000001: futex start
  NetworkManager-541   [001] .... 211229.997667: 0x00000001: futex start
           gdbus-612   [001] .... 211229.997670: 0x00000001: futex start
          mysqld-697   [001] .... 211230.789205: 0x00000001: futex start
          mysqld-697   [001] .... 211230.789227: 0x00000001: futex start
          mysqld-703   [001] .... 211230.789251: 0x00000001: futex start
          mysqld-703   [001] .... 211230.789253: 0x00000001: futex start
          mysqld-704   [001] d... 211230.789258: 0x00000001: futex calls with no return: 3994
          mysqld-704   [001] .... 211230.789259: 0x00000001: futex start
          mysqld-704   [001] d... 211230.789260: 0x00000001: futex calls with no return: 3994
          mysqld-704   [001] .... 211230.789272: 0x00000001: futex start
          mysqld-713   [000] .... 211231.037016: 0x00000001: futex start
          mysqld-713   [000] .... 211231.037036: 0x00000001: futex start
         vmstats-895   [000] .... 211231.464867: 0x00000001: futex start
          mysqld-697   [001] .... 211231.790738: 0x00000001: futex start
          mysqld-697   [001] .... 211231.790784: 0x00000001: futex start
          mysqld-703   [001] .... 211231.790796: 0x00000001: futex start
          mysqld-703   [001] .... 211231.790799: 0x00000001: futex start
          mysqld-704   [001] d... 211231.790809: 0x00000001: futex calls with no return: 4001
          mysqld-704   [001] .... 211231.790812: 0x00000001: futex start
          mysqld-704   [001] d... 211231.790814: 0x00000001: futex calls with no return: 4001

As you can see, pid 697, for example, seems to have called into sys_futex four times without returning in just this small trace.

I don't think this is a race condition in the eBPF code because if you mute the print statements and only print periodically, the count is usually bounded a few around zero for sys_write, which happens magnitudes more often than sys_futex (at least on my system's workload), so I would expect any race conditions to be exacerbated not solved.

I'm running Kernel 4.15.0-43-generic, on Ubuntu 18.04 LTS, which is sitting in VirtualBox.

Happy to provide any more context that might be useful!

There's a thread in the IOVisor mailing list which is relevant: https://lists.iovisor.org/g/iovisor-dev/topic/29702757

If you have the equivalent kernel module code that doesn't have the same issue, it would help to debug! — pchaigno, Feb 10 '19 at 19:02
Sorry! - I'm writing a reply to your message on the mailing list right now; I was expecting to send it a while ago but I found the kernel module was dropping calls sometimes too. — Ashley Davies, Feb 10 '19 at 19:25

score 1 · Accepted Answer · answered Feb 17 '19 at 15:16

It's a known limitation of bcc (cf. iovisor/bcc#1072). Basically, the maximum number of active probes is set too low for your tracing context and you are thus missing some return probes.

In bcc, the maxactive value (maximum number of active probes, see documentation extract below) is left to its default value. Since Alban Crequy's patch to the Linux kernel (cf. iovisor/bcc#1072), it's possible to change the maxactive value when attaching probes through debugfs. That new API hasn't been exposed through bcc's yet though. I'll try to send a patch to that effect this week.

While the probed function is executing, its return address is stored in an object of type kretprobe_instance. Before calling register_kretprobe(), the user sets the maxactive field of the kretprobe struct to specify how many instances of the specified function can be probed simultaneously. register_kretprobe() pre-allocates the indicated number of kretprobe_instance objects.

For example, if the function is non-recursive and is called with a spinlock held, maxactive = 1 should be enough. If the function is non-recursive and can never relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should be enough. If maxactive <= 0, it is set to a default value. If CONFIG_PREEMPT is enabled, the default is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS.

I wasn't entirely convinced that this fully explains why I regularly have runs which seem to have next to no retprobes triggered, while some trace them reliably, but just realised that it's likely due to a number of threads in the system sleeping on a futex for seconds at time repetitively without other actions (looking at you, mysql), which each permanently consume one of the probes. That explains why I could only observe this issue with sys_futex, too (sys_write returned too fast on my system for this to become an issue). Thanks again for your time and effort helping with this! — Ashley Davies, Feb 17 '19 at 16:55

Why is a kretprobe on sys_futex called less often than a corresponding kprobe?

1 Answers1