Application callchain incomplete for perf record samples within glibc

Question

I want to get the callchain for my program using the perf tool. But the result is always incomplete, it always lacks the last function which directly invokes usleep. I have tried to record sched:sched_switch, and usleep trace events, but the result is always the same. Below is the command line:

$  g++ -g  -rdynamic test_sleep.cpp -o test_sleep
$ ./test_sleep &
$ ps -ef |grep test_sleep |grep -v grep
root     15292 14879  0 16:31 pts/1    00:00:02 ./test_sleep

$ perf record -e  probe_libc:usleep -e sched:sched_switch  -gp 15292
$ perf script 

test_sleep 15292 [019] 159309.652668:      probe_libc:usleep: (7ff615142030)

                   ee030 usleep (/usr/lib64/libc-2.17.so)
                    1039 funcs1 (/home/test_sleep)
                    1070 fun (/home/test_sleep)
                    107b main (/home/test_sleep)
                   21b15 __libc_start_main (/usr/lib64/libc-2.17.so)

test_sleep 15292 [019] 159309.663728:     sched:sched_switch: test_sleep:15292 [120] S ==> swapper/19:0 [120]

            7fff8168dbb0 __schedule ([kernel.kallsyms])
            7fff8168e069 schedule ([kernel.kallsyms])
            7fff8168cf36 do_nanosleep ([kernel.kallsyms])
            7fff810b786b hrtimer_nanosleep ([kernel.kallsyms])
            7fff810b79ae sys_nanosleep ([kernel.kallsyms])
            7fff81699089 system_call_fastpath ([kernel.kallsyms])
                   bd410 __GI___libc_nanosleep (/usr/lib64/libc-2.17.so)
                    1039 funcs1 (/home/test_sleep)
                    1070 fun (/home/test_sleep)
                    107b main (/home/test_sleep)
                   21b15 __libc_start_main (/usr/lib64/libc-2.17.so)

Below is the code for my program:

void funcs2()
{
    for(i = 0; i< 100; i++){};
    while(1)
        usleep(10000);
}
void funcs1()
{
    for(i = 0; i< 100; i++){};
    funcs2();
}
void fun()
{
    for(i = 0; i< 100; i++){};
    funcs1();
}
main()
{
    fun();
    return 0;
}

We can see funcs2 is missing in the callchain. Is this function optimized by the g++ compiler? But I do not use -O option and I catch the complete callstack using backtrace_symbols in my programs, below is the callstack recorded by backtrace_symbol:

./test_sleep(_Z6funcs2v+0x32) [0x400e4f]
./test_sleep(_Z6funcs1v+0x35) [0x401039]
./test_sleep(_Z3funv+0x35) [0x401070]
./test_sleep(main+0x9) [0x40107b]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff615075b15]
./test_sleep() [0x400d59]

What an excellent first question for StackOverflow with a good description and code example (I'll condone the missing `int`s for now ;-)). I took the liberty to rephrase the title a little bit to make it easier to find. — Zulan, Nov 10 '17 at 21:14
Thank you, Zulan. This is the first question for StackOverflow. Next time I will pay attention to the details. — Wind, Nov 11 '17 at 09:33

score 1 · Answer 1 · answered Nov 10 '17 at 21:07

By default perf uses frame-pointers to generate the callchain within userspace. Unfortunately frame-pointers are not reliable due to optimization. Your issue arises because your glibc is compiled with a specific optimization (-fomit-frame-pointer) which inhibits and easy stack unwinding.

Let's see what perf does. The in-kernel callchain is done with a different mechanism

        7fff8168dbb0 __schedule ([kernel.kallsyms])
        7fff8168e069 schedule ([kernel.kallsyms])
        7fff8168cf36 do_nanosleep ([kernel.kallsyms])
        7fff810b786b hrtimer_nanosleep ([kernel.kallsyms])
        7fff810b79ae sys_nanosleep ([kernel.kallsyms])
        7fff81699089 system_call_fastpath ([kernel.kallsyms])

The initial userspace entry uses the current userspace instruction pointer, so perf gets this right

               bd410 __GI___libc_nanosleep (/usr/lib64/libc-2.17.so)

Now perf uses the userspace frame pointer to determine the next function. Since libc didn't update the frame pointer due to the optimization. It gets the frame pointer of funcs2 (instead of __GI___libc_nanosleep). The return address of funcs2 points to funcs2 - so that's what perf sees as next on the stack.

                1039 funcs1 (/home/test_sleep)
                1070 fun (/home/test_sleep)
                107b main (/home/test_sleep)
               21b15 __libc_start_main (/usr/lib64/libc-2.17.so)

So this has nothing to do with the optimization within your program. You're even lucky that libc at least left your frame-pointer intact so you continue to see the stack except for the topmost function. That said, perf provides two other options for generating the callchain. --call-graph=dwarf actually stores a part of the stack and unwinds it with post-processing. On my system this shows funcs2, but nothing beyond that in userspace. There is also --call-graph=lbr which uses some hardware support of Intel CPUs. Unfortunately I can't test it.

This behavior affects any perf samples within libc, be it different events, syscalls or just hardware samples. For reference, see for instance this debian bug report discussing a frame-pointer version of libc. You could of course build your own libc with frame-pointers.

Hi Zulan. What system are you using ? Centos or Ubuntu ? I remember glibc is optimized (-fomit-frame-pointer) by default. You said your result contains the funcs2, and how did you do that ? — Wind, Nov 13 '17 at 02:22

Application callchain incomplete for perf record samples within glibc

1 Answers1