I want to write a C program that triggers execution of a bpf program when a syscall is executed on a specific CPU by any process/thread
So the idea is to do a perf_event_open(pattr, -1, {MY_CPU_NUM}, -1, 0)
followed
by ret = ioctl(efd, PERF_EVENT_IOC_SET_BPF, prog_fd);
. My BPF program increments a counter in a map, that I am reading.
The specific system call I am using in my example is sys_exit_unlinkat
, and I am testing the program by command taskset --cpu-list {ANY_CPU_OTHER_THAN_MY_CPU_NUMBER} rm -rf {DIRECTORY}}
.
I expect that if I command to remove directory from a different core than where I placed my perf event, I should not see my counter increment. However, I see my counter increment irrespective of the cpu argument I provide in perf_event_open
.
I dont understand why!
I tried, seeing what does perf record -C XX
do, and it shows up bunch of perf_event_open
along with one perf_event_open with PERF_TYPE_TRACEPOINT
with similar arguments as mine, and it works correctly that it shows output only when rm -rf is executed on the MY_CPU_NUM.
Code Snippet:
pattr.type = PERF_TYPE_TRACEPOINT;
pattr.size = sizeof(pattr);
pattr.config =721; //unlinkat // 723; // rmdir
pattr.sample_period = 1;
pattr.wakeup_events = 1;
pattr.disabled = 1;
pattr.exclude_guest = 1;
pattr.sample_type = PERF_SAMPLE_RAW;
efd = perf_event_open(&pattr, -1, 0, -1, 0); // cpu number is zero
if(efd < 0) {
printf("error in efd opening, %s\n", strerror(errno));
exit(1);
}
ret = ioctl(efd, PERF_EVENT_IOC_SET_BPF, prog_fd);
if (ret < 0) {
printf("PERF_EVENT_IOC_SET_BPF error: %s\n", strerror(errno));
exit(-1);
}
ret = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
if (ret < 0) {
printf("PERF_EVENT_IOC_ENABLE error: %s\n", strerror(errno));
exit(-1);
}
output of uname -a
Linux zephyr 5.4.0-110-generic
in my machine.
EDIT-1:
Okay, I tried some noob debugging by putting the kernel into gdb and trying to figure out the issue.
So, in the syscall_exit path perf_syscall_exit
(kernel/events/trace_syscalls.c) is called, which then looks if there is some perf event associated with the current cpu.
code snippet:
static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
{
...
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
return;
sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
return;
head = this_cpu_ptr(sys_data->exit_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
if (!valid_prog_array && hlist_empty(head)) // <--- WATCH
return;
...
Now, in the above code, see where I commented WATCH. So what it checks I think is, that if the program is invalid and the event list is empty, return. So, imagine if the program is valid yet the event list is empty, then irrespective whether cpu has an event attached or not, this check will not pass and we will go ahead exeucting the BPF program.
So, I checked by installing perf_event without attaching bpf program and I saw that the check passed and we did not go ahead when the rm -rf {DIRECTORY}
was executed from a different cpu. And when I executed from the core 0(where event was attached), the check failed and the program proceeded ahead.
So does that mean, that in the kernel, we cannot attach BPF program to an event that is tied to a specific CPU? Is this a kernel bug? or design necessity?