Perf Imprecise Call-Graph Report

Question

Recent Intel processors provide a hardware feature (a.k.a., Precise Event-Based Sampling (PEBS)) to access precise information about the CPU state on some sampled CPU events (e.g., e). Here is an extract from Intel 64 and IA-32 Achitecture's Software Developer's Manual: Volume 3:

18.15.7 Processor Event-Based Sampling (PEBS)

The debug store (DS) mechanism in processors based on Intel NetBurst microarchitecture allow two types of information to be collected for use in debugging and tuning programs: PEBS records and BTS records.

Based on Chapter 17 of the same reference, the DS format for x86-64 architecture is as follows: The BTS Buffer records the last N executed branches (N is dependent on the microarchitecture), while the PEBS Buffer records the following registers: IIUC, a counter is set and each event (e) occurrence increments its value. When the counter overflows, an entry is added to both of these buffers. Finally, when these buffers reach a certain size (BTS Absolute Maximum and PEBS Absolute Maximum), an interrupt is generated and the contents of the two buffers are dumped to disk. This will happen, periodically. It seems that the --call-graph dwarf backtrace data is also extracted in the same handler, Right?

1) Does this mean that LBR and PEBS (--call-graph --lbr) state, perfectly, match together?

2) How about the --call-graph dwarf output, which is not part of PEBS (as seems obvious in the above reference)? (Some RIP/RSPs do not match the backtrace)

Precisely, here is an LKML Thread, where Milian Wolff shows that the second question is, NO. But I do not fully understand the reason?

The answer to the first question is also, NO (expressed by Andi Kleen in the latter messages of the thread), which I do not understand at all.

3) Does this mean that the whole DWARF call-graph information is completely corrupted?

The above thread does not show this, and in my experiments I do not see any RIP not matching the backtrace. In other words, can I trust the majority of the backtraces?

I do not prefer the LBR method which may, itself, be imprecise. It is also limited in the size of the backtrace. Although, here is a patch to overcome the size issue. But this is recent and may be bogus.

UPDATE:

How is it possible to force Perf to store only a single record in PEBS Buffer? Is it only possible to force this configuration, indirectly, e.g., when call-graph information is required for a PEBS event?

BeeOnRope · Accepted Answer · 2022-05-26T18:16:00.233

3

The section of the manual you quoted talks about BTS, not LBR: they are not the same thing. Later in that same thread you quoted Andi Kleen seems to indicate that the LBR snap time is actually the moment of the PMI (the interrupt that runs the handler) and not the PEBS moment. So I think all three stack approaches have the same problem.
DWARF stack captures definitely do not correspond exactly to the PEBS entry. The PEBS event is recorded by the hardware at runtime, and then only some time later is the CPU interrupted, at which point the stack is unwound. If the PEBS buffer is configured to hold only a single entry, these two things should at least be close and if you are lucky, the PEBS IP will be in the same function that is still at the top of the stack when the handler runs. In that case, the stack is basically correct. Since perf shows you the actual PEBS IP at the top, plus the frames below that from the capture, this ends up working in that case.
If you aren't lucky, the function will have changed between the PEBS capture and the handler running. In this case you get a franken-stack that doesn't make sense: the top function may not be callable from the second-from-the-top function (or something). It is not totally corrupted: it's just that everything except the top frame comes from a point after the PEBS stack was captured, and the top frame comes from PEBS, or something like that. This applies also to --call-graph fp, for the same reasons.

Most likely you never saw an invalid IP because perf shows the IP from the PEBS sample (that's the theme of that whole thread). I think if you look into the raw sample, you can see both the PEBS IP, and the handler IP, and you can see they usually won't match.

Overall, you can trust the backtraces for "time" or "cycle" profiling since they are in some sense an accurate sampling representation of execution time: it's just that they don't correspond to the PEBS moment but some time later (but why is that later time any worse than the PEBS time). Basically for this type of profiling you don't really need PEBS at all.

If you are using a different type of event, and you want fine-grained accounting of where the event took place, that's what PEBS is for. You often don't need a stack trace: just the top frame is enough. If you want stack traces, use them, but know they come from a moment in time a bit later, or use --lbr (if that works).

edited May 26 '22 at 18:16

answered Apr 19 '20 at 03:44

BeeOnRope

60,350
16
207
386

Thanks! I checked all `RIP` values with the values at the top of the stack (both extracted from `perf script -D` output. Only *13* out of *488* values do not match and those are for single-line backtraces like what follows, which seems irrelevant to the problem. `evince 8462 11769.772437: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffffffbfa0 0 N/A|SNP N/A|TLB N/A|LCK No ffffffff84601bd0 [unknown] ([unknown]) `. Perhaps I should increase the number of events the see the discrepancy. – TheAhmad Apr 19 '20 at 20:04
@TheAhmad - we should be clear that there are two RIP values: let's call them RIPP (the RIP captured in the PEBS same: basically the "true" RIP of the hardware event), and RIPI (the RIP captured in the subsequent interrupt, which corresponds to the moment the stack was walked). – BeeOnRope Apr 21 '20 at 23:24
1

In that case what is the RIP you are comparing? As I mentioned, it seems like in that thread they decided to show RIPP at the top of the stack, even though RIPI logically belong there, and that they possibly weren't even saving RIPI (because maybe there was only space for one "RIP" in the data capture or some intermediate structure). In that case you won't see any inconsistency between the top of stack and RIPI because one was literally used to populate the other. The inconsistency would be instead between the other frames and the top frame. – BeeOnRope Apr 21 '20 at 23:27
This can also be correct. But there is no certainty. Because, that way they are, intentionally, generating another wrong backtrace. In fact, out of around *250* unique backtraces, I could detect *240* live executions at runtime (The remaining backtraces may have escaped, merely, because of the bugs in my tool). This means that **almost** all (here, *240*) reported backtraces are correct or pollute **only** the IP for the innermost function. – TheAhmad Apr 22 '20 at 00:01
@TheAhmad - about "reported backtraces are correct or pollute only the IP for the innermost function." - but that's exactly what I'm saying! The backtrace is correct in a sense: just from a different moment in time. The only way it will be "corrupted" is when perf decides to show the innermost function with the RIPR not the RIPI, but the rest of the backtrace is still "real". Or perhaps I am not understanding what you are saying. – BeeOnRope Apr 22 '20 at 00:07
To usefully advance this discussion, you should clarify what you mean by "correct" backtrace and "pollute only the IP". – BeeOnRope Apr 22 '20 at 00:11
So, if we accept this, having a broken backtrace is **extremely** rare. Also, if we accept your suggestion, the broken backtrace can only be a backtrace where `RIPP` is not in the function containing `RIPI`. – TheAhmad Apr 22 '20 at 00:17
1

@TheAhmad what is _this_? Just because the RIPP is in the same function as RIPI doesn't mean you can stitch the backtrace together with the RIPP as the top frame and get an accurate backtrace from the moment the HW event: execution may have left the function and then called again the function with a different stack. So you can only guarantee the stiching is correct when a given leaf function can only have one backtrace. – BeeOnRope Apr 22 '20 at 00:25
1

About how rare is for the RIPI and RIPP to be in different functions, I would guess it is _extremely_ common if you have short functions. If your functions are longer (many 10s or 100s of cycles), then it would be less common. Why don't you run some tests with small, trivial functions? – BeeOnRope Apr 22 '20 at 00:26
I missed that case (where we returned from the function where the `PEBS` event had occurred)! But those short functions should be called frequently and also, should not be inlined, and also the outer parts of the stack in that case should be different. The test is a good idea, for detecting how `Perf` works in this situation. – TheAhmad Apr 22 '20 at 00:44
@TheAhmad - yeah, I am curious myself. I was able to give this answer based on what I know, but I am not 100% sure the current state or the behavior. Almost everything I know is from that same thread you found which I participated in when it was happening, but I am not 100% clear on the outcome. I think the situation is basically as I describe it: and as long as the raw record contains both the RIPP and RIPI, I think you could make a different decision on how to display this than `perf report` with a bit of hacking. – BeeOnRope Apr 22 '20 at 16:27
I used a code similar to what is used here: https://easyperf.net/blog/2018/08/29/Understanding-performance-events-skid. I used `PEBS` and all `RIPP`s matched the value at the top of the stack (while a skid was inevitable for the stack trace). Therefore, you are right about the fact that `RIPP` overwrites `RIPI` in the stack report. – TheAhmad Apr 23 '20 at 18:07

Perf Imprecise Call-Graph Report

1 Answers1