0

I used the following command to sample backtraces for an ffmpeg benchmark:

sudo perf record -d --call-graph dwarf,65528 -c 1000000 -e mem_load_uops_retired.l3_miss:u ffmpeg -i /media/ahmad/DATA/Videos/video.mp4 -threads 1 -vf spp out.mp4

As can be seen, PEBS is not used, the stack size is set to the maximum and the sampling period is quite large. I also limited the thread count, but this is the first part of perf script --no-demangle output:

ffmpeg 11750  6670.061261:    1000000 mem_load_uops_retired.l3_miss:u:                0         5080021 N/A|SNP N/A|TLB N/A|LCK N/A
        7fffeab68844 x264_pixel_avg_w16_avx2+0x4 (/usr/lib/x86_64-linux-gnu/libx264.so.152)

ffmpeg 11750  6670.274835:    1000000 mem_load_uops_retired.l3_miss:u:                0         5080021 N/A|SNP N/A|TLB N/A|LCK N/A
        7fffeab68844 x264_pixel_avg_w16_avx2+0x4 (/usr/lib/x86_64-linux-gnu/libx264.so.152)

ffmpeg 11750  6670.496159:    1000000 mem_load_uops_retired.l3_miss:u:                0         5080021 N/A|SNP N/A|TLB N/A|LCK N/A
        7fffeab8ef89 x264_pixel_sad_x4_16x16_avx2+0x49 (/usr/lib/x86_64-linux-gnu/libx264.so.152)

ffmpeg 11750  6670.852598:    1000000 mem_load_uops_retired.l3_miss:u:                0         5080021 N/A|SNP N/A|TLB N/A|LCK N/A
        7fffeaac97b3 pixel_memset+0x293 (inlined)
        7fffeaac97b3 plane_expand_border+0x293 (inlined)
        7fffeaac97b3 x264_frame_expand_border_filtered+0x293 (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab463bc x264_fdec_filter_row+0x69c (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab49523 x264_slice_write+0x1873 (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab85285 x264_stack_align+0x15 (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab45bdb x264_slices_write+0xfb (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        5555561e3d87 [unknown] ([heap])

ffmpeg 11750  6671.110007:    1000000 mem_load_uops_retired.l3_miss:u:                0         5080021 N/A|SNP N/A|TLB N/A|LCK N/A
        7fffeab6cdde x264_frame_init_lowres_core_avx2+0x8e (/usr/lib/x86_64-linux-gnu/libx264.so.152)

ffmpeg 11750  6671.463562:    1000000 mem_load_uops_retired.l3_miss:u:                0         5080021 N/A|SNP N/A|TLB N/A|LCK N/A
        7fffeaabf806 x264_macroblock_load_pic_pointers+0x886 (inlined)
        7fffeaabf806 x264_macroblock_cache_load+0x886 (inlined)
        7fffeaabf806 x264_macroblock_cache_load_progressive+0x886 (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab49204 x264_slice_write+0x1554 (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab85285 x264_stack_align+0x15 (/usr/lib/x86_64-linux-gnu/libx264.so.152)
        7fffeab45bdb x264_slices_write+0xfb (/usr/lib/x86_64-linux-gnu/libx264.so.152)
                  1c [unknown] ([unknown])

None of the backtraces are correct. Because none of them begin with _start or __GI___clone. I also used LBR, instead. But it has more size constraints and, therefore, not suitable. Any suggestions on how to get around the problem?


UPDATE:

The problem happens for all events that I checked. When I used mem_load_uops_retired.l3_miss or LLC-load-misses the problem was visible from the beginning. I also checked the output with the cycles event and everything worked fine, at the beginning. But after that, the same problem was seen.

Also, note that, the problem disappears when I sample only kernel mem_load_uops_retired.l3_miss events.

TheAhmad
  • 810
  • 1
  • 9
  • 21
  • What is your CPU model? – osgx Jun 12 '20 at 00:12
  • `Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz`. I also checked with explicitly calling `perf_event_open()`, but nothing changed. Currently, I'm trying to find the kernel source code that dumps *user callchain instruction pointers*. It seems that `perf_output_sample()` should print *callchain samples*, here: https://github.com/torvalds/linux/blob/master/kernel/events/core.c#L6786. But I cannot view the contents, yet. – TheAhmad Jun 12 '20 at 02:36
  • Sorry, @osgx. Do you know the kernel mechanism for extracting userspace callchains? Am I on the right track? – TheAhmad Jun 14 '20 at 16:11
  • 1
    `perf script -D` will dump some raw data from perf.data. perf_overflow_sample_ustack() https://elixir.bootlin.com/linux/v5.7/source/kernel/events/core.c#L6407 does user stack sampling by copying top of stack from user thread into perf ring buffer. Not sure that l3_miss event can generate precise overflow, but cpu core events like cycles should. – osgx Jun 15 '20 at 17:15

0 Answers0