2

How would I go about monitoring a particular process's execution (namely, its branches, from the Branch Trace Store) using the Intel Performance Counter monitor, while filtering out other process's information?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
user541686
  • 205,094
  • 128
  • 528
  • 886
  • Vtune has process/thread/image filtering capability and does support PMU. Can't say anything about BTS, but in linux `perf` tool has `perf branch` mode. – osgx May 17 '12 at 23:48
  • @osgx: Ooh you're referring to Intel's VTune... yeah it would probably have the feature, except it's not free. :( I'll look at the trial, but I'm not sure it'll work for me.... (I need it specifically for monitoring Windows apps.) – user541686 May 17 '12 at 23:51
  • 2
    Ah, OProfile is Linux-specific. D'oh. – sarnold May 17 '12 at 23:52
  • Mehrdad, do you already have any working code which uses BTS? Can you post it? I'm not sure does `IntelPerformanceCounterMonitorV2.0.zip` allows you to do this. – osgx May 17 '12 at 23:59
  • @osgx: No I don't have any code for it... for some reason I thought I saw some performance counters (after installing the tool) that indicated the information must have been there, but now that you mention it, I could've been wrong... I'll look into it more, thanks for pointing it out. – user541686 May 18 '12 at 00:09
  • Ok, then can you describe your task (your goals)? If the application is user-space and you don't need BTS info of kernel-mode code; I think you will get only single thread BTS (if at thread reschedule kernel will dump/restore MSR_DEBUGCTLA MSR as thread specific register). – osgx May 18 '12 at 00:12
  • 1
    @osgx: The goal is to do some tracing of a program's execution to figure out the types of jumps involved (indirect, direct, calls vs. jumps, etc.), and information like that... I definitely don't need kernel mode tracing, but I need it to be as fast as possible (the Pin Tool already does the job, but it's very slow and hence sorta unusable). Do you know of any way to get the BTS info? (I assume Windows doesn't have it built-in, so would I need to somehow hook the thread scheduler to get it per-thread??) – user541686 May 18 '12 at 00:14

3 Answers3

2

You should know that BTS (Branch trace store) and Performance monitoring events/counters (inside CPU, its PMU block) are very different things.

The Branch Trace Store is function of CPU when it does record every taken branch (pairs of eip - first of branch instruction and second of branch target; there is also a word of flags added to each pair) in special area of memory. Result of it is very like to Single-stepping and recording order of executed code blocks (basic blocks). It is just like doing code coverage with assistance from compiler, when every branch is instrumented by compiler.

BTS is just a bit in the MSR_DEBUGCTLA MSR (it is intel x86 register); I'm almost sure that this register is thread-specific (as it is in Linux), so you need no to hook scheduler. There is some examples of working with this MSR in windows; but different bit is used. Also, don't forget to set DS_AREA correctly. So, if you really want BTS, take a copy of Intel Arch Manual (Volume 3b, Part "Debugging and Performance monitoring", section "19.7.8 Branch Trace Store (BTS)") and program BTS manually. Hardest part is to handle DS area overflow (you need custom interrupt handler).

If you want to know not a trace of executed code but statistics of you program (how much instructions executed; how well was branches predicted; how much indirect branches are here ...), you should use Performance monitoring events aka "Precise Event Based Sampling" (PEBS). Intel Vtune does this; there should be some other tools, even the Intel PBS your linked. The only problem (this is bit more difficult with free tools) is to find name of Events you want. Events based on instruction execution are always binded to some thread.

What does event-based sampling means: you can set some limit, e.g. 1000 for some event, eg. BR_INST_EXEC.COND ("number of conditional near branch instructions executed") or BR_INST_EXEC.DIRECT ("all unconditional near branch instructions excluding calls and indirect branches."), up to 2-4 events at once. Then CPU will count every situation which correspond to this event. When there will be 1000th situation, the Event (interrupt) will be generated for instrution EIP. With sampling it is easy to get detailed statistics of your code behaviour. If you will set limit to something very low and if you will not sum events for eip, you will get trace ;)

With PEBS you can know how bad is your code for the CPU, where mispredicted branches are located, which instructions wait data from cache, etc. There are 100s of events (appendix A of Volume 3b).

PS there is some code for BTS/win: http://blog.csdn.net/quincy_hu/article/details/4053163

PPS there is shorter overview of PMU programming, both PEBS and BTS. software.intel.com/file/30320 It is for Nehalem, but it can be actual even for Sandy.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • Hmm... I'm not really looking for "statistics" per se, but rather the precise nature of certain branches (for example, do certain ones occur before the other ones? Do they always come in certain kinds of groups? Are there any exceptions? Are the 'weird' calls inside external libraries, or in the program itself? etc.) so it's not just a 'counter' that I need. It's a run-time thing -- compile-time analysis is useless for what I'm doing. What should I look for if I need this info? – user541686 May 18 '12 at 00:29
  • And haha I was looking at that exact page when you posted it, thanks. :) +1 – user541686 May 18 '12 at 00:31
  • I still don't fully understand your tasks/goals. – osgx May 18 '12 at 00:32
  • >_< not sure how to explain it better. Basically, I need all of the branch addresses on a per-thread basis... is that clearer? – user541686 May 18 '12 at 00:33
  • I still can't understand what is your program if you need traces but not sampling. – osgx May 18 '12 at 00:46
  • It's a *complete* trace of the execution... no 'sampling' involved. (Hence why I'm wondering if I would need to hook the scheduler, since I don't want to 'miss' any branches if the buffer overflows.) – user541686 May 18 '12 at 00:56
  • I just tried what [this page](http://www.openrce.org/blog/view/535/Branch_Tracing_with_Intel_MSR_Registers) suggested, but it seems like the MSR gets reset back to 0 every once in a while (after every context switch? I can't tell)... any idea why? – user541686 May 18 '12 at 01:04
  • Have no idea. I'm linux programmer, not windows. Please, take look on "PS" - there is some blog post about BTS usage in windows. – osgx May 18 '12 at 01:23
1

We were forced to build our own instrumenting profiler that reads the MSRs directly to get this information. The Performance Counter Monitor's source code demonstrates how to build a kernel driver that reads them.

Previously we used VTune, but it crashes when run on our app. (When we tried OProfile on the Linux version, it actually crashed the entire kernel and forced us to power-cycle the machine, which was pretty funny.)

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • You did BTS or PEBS monitoring? – osgx May 18 '12 at 00:32
  • @osgx In our case we were counting branch mispredicts and their performance costs -- we're instrumented, not sampling, so were getting precise counts of specific code blocks. – Crashworks May 18 '12 at 00:49
  • Ok, so you does precice counting on parts of code. Why not to use PAPI or other API to PMU counters? – osgx May 18 '12 at 00:51
  • 1
    @osgx Intel's libraries are basically a more usable and Windows-friendly alternative to PAPI for reading the same PMCs. PAPI is broken on 64-bit Windows these days. – Crashworks May 18 '12 at 01:13
0

Check out https://github.com/andikleen/pmu-tools/blob/master/toplev.py

Examples: toplev.py -l2 program measure whole system in level 2 while program is running

PhilKo
  • 71
  • 3
  • Awesome thanks! Do you know if these have any way to attribute the counters to a given thread or process? (Given I imagine that requires saving/restoring the counters on thread context switches.) – user541686 Nov 23 '22 at 16:26