1

This is not a duplicate question. It has been claimed that this question is a duplicate of this one. However, I didn't mention "Linux" or "Kernel" (neither in the tags nor in the text). Hence, claiming this being a duplicate of a question which deals with Linux and perf is wrong.

I'd like to know how to measure interrupt times without external programs. In other words, I'd like to do the time measurement in the code myself, ideally using hardware registers. For the sake of this question, let's suppose that there is no O/S.

Having said this:

In an assembler program which runs on a Pentium-M like processor, I would like to measure the time a certain procedure needs for execution. This is usually a no-brainer, there are many articles which state and show how to do that, and I also have my own method which works reliably.

However, in this case, there is a problem: The procedure may be interrupted (by hardware interrupts) at any time. Since I'd like to measure the pure execution time of the procedure itself, things are getting more complicated:

  • Measure the whole time the procedure has needed (easy)
  • Measure the time the interrupt handlers have needed while the procedure was running (not that easy)
  • Subtract the interrupt time from the whole time to get the figure I'm interested in

I always thought that on "modern" Intel PC CPUs there is a counter which only counts up while the CPU executes an interrupt handler. But that doesn't seem to be the case. At least, I haven't found it in the "Performance Monitoring" Chapter of the Intel 64 and IA-32 Architectures Software Developer's manual.

I have worked out a solution which fits my needs for the moment, but is not as precise as I'd like it to be for future cases, and it is not very elegant.

Therefore, I'd like to know whether I have missed a hardware counter which could help me by counting only while executing an interrupt handler (or alternatively, counting only when executing code which is not in an interrupt handler).

Disabling interrupts to measure the pure procedure execution time is not an option, because the things which happen in the interrupt handlers may have effects on the execution of the procedure.

The procedure and the interrupt handlers are running on the same core.

The whole code (procedure and interrupt handlers) is running in ring 0.

Binarus
  • 4,005
  • 3
  • 25
  • 41
  • If you just want to profile user-space, use a hardware PMU counter programmed to only count user-space, not kernel. (Like `perf stat --all-user` or `perf stat -e cycles:u` instead of `-e cycles`.) There is HW support in the counters for masking by current privilege level. – Peter Cordes Jun 01 '22 at 17:39
  • The CPU doesn't keep track of whether it's "in an interrupt handler" or not. It knows when one starts, but a kernel can go on to run other kernel code like `schedule()` instead of just doing an `iret` back to the same user-space. (And it doesn't have to poke anything in the CPU to say it's leaving an interrupt-handler. It might do something with a logically-separate APIC and/or enable interrupts in the core if they weren't already, but that's separate from the CPU core proper.) – Peter Cordes Jun 01 '22 at 17:41
  • Let the profiled procedure execute many times and hope that at least once it won't be hardware-interrupted. Use the lowest measured time then. – vitsoft Jun 01 '22 at 19:13
  • @PeterCordes Thank you very much for your hints. I should have written that I'd like to look at this question in an O/S-agnostic way, and that I must measure the execution times myself in the code, during production and continuously. Actually, there is no O/S, or more precisely, I use it only to boot the PC and start the application, which then takes complete control over the hardware. – Binarus Jun 02 '22 at 05:50
  • I closed as a duplicate based on it being an X-Y problem, and what you really wanted was to profile something in user-space without counting kernel time spent interrupt handlers. i.e. solving the *or alternatively, counting only when executing code which is not in an interrupt handler* part of the question. (continued...) – Peter Cordes Jun 02 '22 at 05:50
  • ... Since you were already looking at Intel's docs for how to program the PMU, I figured that might be enough of a nudge in the right direction, to look for setting the mask for when to count to only count user-space. Especially since most people use an OS and there was no mention of bare metal here, so for example it's common to use a system call to get the kernel to program the PMU, then use `rdpmc` in user-space or another system call to collect profiling results. Details of that vary by OS, and which API one is using. Or should be documented in the manual if doing it manually. – Peter Cordes Jun 02 '22 at 05:50
  • 1
    I was assuming that the code you *do* want to profile runs in ring 3 (CPL=3). If not, the HW support can't help you. Excluding interrupts would also mean excluding kernel time spent in system calls, if the code you're profiling makes system calls. Or for your bare-metal application, if it keeps the CPU in kernel mode full time, then you're out of luck other than having interrupt handlers run extra instructions to record counter values on entry/exit and subtract from the total, or something like that. Please clarify those details, like whether you need to count some CPL=0 time. – Peter Cordes Jun 02 '22 at 05:55
  • @PeterCordes Thank you very much again. To give some more detail, the code runs in ring 0 all the time. The code whose execution time must be monitored is a numerical algorithm whose execution time can't be mathematically (theoretically) proven (at least, we haven't found a proof yet). Plus, there is no chance to test the algorithm with every input parameter combination, and there also is no chance to determine worst-case input parameter combinations with mathematical methods. So we have to monitor that code in the real application, while it runs in production. – Binarus Jun 02 '22 at 06:01
  • 1
    Ok, then not a duplicate of [Perf instruction/cycles count in userspace/kernelspace alone in Linux](https://stackoverflow.com/q/69573380) , because of the very unusual behaviour of running your number-crunching in ring 0. Also, mathematically proven execution time isn't something you'd ever expect, except maybe within an order of magnitude or two (since cache miss vs. cache hit can make a huge difference, to some algos more than others), so it hardly seems worth mentioning that you couldn't prove it to an accuracy close enough for interrupt handlers to matter. – Peter Cordes Jun 02 '22 at 06:11
  • Thank you very much for reopening. Actually, it's a mixture: It would help a lot if we could mathematically prove how much iterations the algorithm needs to achieve the result with the desired accuracy. If we knew that, we could set up a safety margin for cache miss etc. Apart from that, you're totally right: Originally, we couldn't run that algorithm in the time required. Only after we had applied tricks like dynamically manipulating the stack frame so that certain local variables are always aligned, and after making sure that the algorithm fits in the first-level cache, it worked. – Binarus Jun 02 '22 at 07:14

1 Answers1

2

No, there isn't hardware support for this, only for programming a counter to count in ring 0 (kernel mode) vs. ring 3 (user space). That's what Linux perf uses to implement perf stat --all-user or --all-kernel, or the cycles:u or :k modifiers. (I'm not sure which one ring 1 and ring 2 get lumped in with).

The x86 ISA doesn't distinguish the state of being in an "interrupt handler" as special. That's merely a software notion, e.g. an interrupt handler in a mainstream kernel might end by jumping to a function called schedule() to decide whether to return to the task that was interrupted, or to some other task. There might eventually be an iret (interrupt-return), but that's not "special" beyond popping CS:RIP, RSP, and RFLAGS from the current stack, which might be hard to emulate with other instructions.

But if a kernel context-switches to a task that had previously made a blocking system call, it might return to user-space via sysret, only running an iret much later after context-switching back to a task that got interrupted. You don't need to do anything special to tell an x86 CPU you've finished an interrupt handler (unlike some other ISAs perhaps), so there's nothing the CPU could even watch for.

The APIC (interrupt controller for external interrupts) may need to get poked to let it know that we're ready to handle further interrupts of this type, but the CPU core itself probably doesn't keep track of this.

So there are a few different heuristics one could imagine hypothetical x86 hardware using to tell when an interrupt handler had finished, but I don't think actual x86 hardware PMUs do any of them.


For the normal case of profiling code that runs in user-space (and doesn't make system calls), perf stat --all-user (or manually programming the PMU settings it would use) would do exactly what you want, not counting anything while the CPU is in kernel mode. The only kernel time would be in interrupt handlers.

But for your case, where the code you want to profile is running in ring 0, the HW can't help you.

Unless you do extremely time-consuming things in your interrupt handlers (compared to the amount of other work), it's probably good enough to just let them get counted. At least for events like "cycles". If interrupt handlers cause a lot of TLB misses or cache misses or something, that might throw off your counts for other events.

Your interrupt handlers could run rdpmc at the start/end and maybe sum up the counts for each event into some global (or core-local) variables, so you'd have something to subtract from your main counts. But that would add the overhead of multiple rdpmc instructions to each interrupt handler.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you very much, accepted and +1. As I've written, I already have a method to achieve my goal, but it is not that elegant and not extremely precise (in that it accounts the time from the HW int to the first instruction in the handler to the normal code instead of to the handler, for example); hence the question. In my case, the handlers may indeed be time consuming because of hardware accesses, so I have no choice than taking their execution time into account. In summary, I have a solution I can live with, but wanted to know how whether I could do it better. Thanks again! – Binarus Jun 03 '22 at 05:55