Major Perf and PIN profiling discrepancies

Question

To analyze certain attributes of execution times, I was going to use both Perf and PIN in separate executions of a program to get all of my information. PIN would give me instruction mixes, and Perf would give me hardware performance on those mixes. As a sanity check, I profiled the following command line argument:

g++ hello_world.cpp -o hello

So my complete command line inputs were the following:

perf stat -e cycles -e instructions g++ hello_world.cpp -o hello
pin -t icount.so -- g++ hello_world.cpp -o hello

In the PIN commands, I ignored all the path stuff for the files for the sake of this post. Additionally, I altered the basic icount.so to also record instruction mixes in addition to the default dynamic instruction count. The results were astonishingly different

PIN Results:
Count 1180608
14->COND_BR: 295371
49->UNCOND_BR: 21869
//skipping all of the other instruction types for now

Perf Results:
       20,538,346 branches                                                    
       105,662,160 instructions              #    0.00  insns per cycle        

       0.072352035 seconds time elapsed

This was supposed to serve as a sanity check by having roughly the same instruction counts and roughly the same branch distributions. Why would the dynamic instruction counts be off by a factor of x100?! I was expection some noise, but that's a bit much.

Also, the amount of branches is 20% for Perf, but PIN reports around 25% (that also seems like a tad wide of a discrepancy, but it's probably just a side effect from the massive instruction count distortion).

g++ internally starts a lot of programs: the cc1 compiler itself, as assembler, ld linker. Add `-v` option to g++ to see all subprograms and try to modify your g++ command to start only single tool, e.g. `-c` (compiler+assembler) or `-S` (compiler). — osgx, Mar 03 '14 at 00:30

score 1 · Answer 1 · answered Apr 05 '21 at 22:31

There are significant differences between what's counted by the icount pintool and the instructions performance event, which is mapped to the architectural Instructions Retired hardware performance event on modern Intel processors. I assume you're on an Intel processor.

pin is only injected in child processes when the -follow_execv command-line option is specified and, if the pintool registered a callback function to intercept process creation, the callback returned true. On the other hand, perf profiles all child processes by default. You can tell perf to only profile the specified process using the -i option.
perf, by default, profiles all events that occurs in user mode and kernel mode (if /proc/sys/kernel/perf_event_paranoid is smaller than 2). pin only supports profiling in user mode.
The icount pintool counts at the basic block granularity, which is essentially a short, single-entry, single-exit sequence of instructions. If an instruction in the block caused an exception, the rest of the instructions in the block will not be executed, but they've already been counted. An exception may be handled without terminating the program. instructions only count instructions at retirement.
The icount pintool, by default, counts each iteration of a rep-prefixed instruction as one instruction. The instructions event counts a rep-prefixed instruction as a single instruction irrespective of the number of iterations.
On some processors, the instructions event may over count or under count.

The instructions event count may be larger due to the first two reasons. The icount pintool instruction count may be larger due to the next two reasons. The last reason may result in unpredictable discrepancies. Since the perf count is about 100x larger than the icount count, it's clear that the first two factors are dominant in this case.

You can get the two tools to get a lot closer counts by passing -i to perf to not profile children, adding the :u modifier to the instructions event name to count only in user mode, and passing -reps 1 to pin to count rep-prefixed instructions per instruction rather than per iteration.

perf stat -i -e cycles,instructions:u g++ hello_world.cpp -o hello
pin -t icount.so -reps 1 -- g++ hello_world.cpp -o hello

Instead of passing -i to perf, you can pass -follow_execv to pin as follows:

pin -follow_execv -t icount.so -reps 1 -- g++ hello_world.cpp -o hello

In this way, both tools will profile the entire process hierarchy rooted at the specified process (i.e., a running g++).

I expect the counts to be very close with these measures, but they still won't be identical.

*The instructions event counts a rep-prefixed instruction as a single instruction irrespective of the number of iterations.* - has anyone tested if a page-fault or timer interrupt during a big `rep movsb` can get it counted more than once? The the partial completion does have to "retire". Pretty small difference, still at least a factor of 4096 fewer counts than PIN if it counts each time it partially or fully finishes, but since you're enumerating corner cases... — Peter Cordes, Apr 06 '21 at 03:02
@PeterCordes AFAIK, only [Weaver](http://web.eece.maine.edu/~vweaver/projects/deterministic/ispass2013_deterministic.pdf) has tested the impact of hardware interrupts and page faults on instruction count. According to his results, on SnB/IvB and some older processors, each hardware interrupt and page fault causes an additional instruction to be counted (even in the middle of a rep-prefixed instruction). But I don't know if anyone has tested on more recent processors. Important point though because the question was posted in 2013, so it's very relevant. — Hadi Brais, Apr 06 '21 at 11:37

Major Perf and PIN profiling discrepancies

1 Answers1