Measuring CPU cycles under qemu (x86_64)

Question

I'm trying to more or less accurately determine for how many CPU cycles a function in a program running under QEMU (x86_64) (with the -enable-kvm flag set if that matters) executes for.

Following the instructions in this Intel white paper, it seems that in order to get the most accurate readings I need to use some combination of the rdtsc, rdtscp and cpuid instructions.

There are two issues with this:

On my host machine (also x86_64), the rdtscp instruction is supported but under QEMU it's not. I have failed to find information on this, is this feature generally absent under QEMU?
It seems that both rdtsc and rdtscp might cause VM exists, interfering with the accuracy of my measurements. How can I tell whether this is the case and is there a way I can prevent it?

Are you sure you only want to count reference cycles (fixed frequency), not actual core clock cycles? And yes, if your VM is set to vmexit on rdtsc, that adds a huge amount of measurement overhead. Avoid VMs if you can when microbenchmarking, otherwise design your microbenchmark repeat count to be high enough that timing overhead is still negligible. e.g. a repeat loop. But that doesn't work well if you need to measure a function in situ, with cache / branch predictors in the state surrounding code left, not an artificial repeat loop. — Peter Cordes, Jun 19 '21 at 15:48
Most VMs should have a way to set the guest CPU feature level to pass through more host CPU features, probably including `rdtscp` as well as stuff like AVX. — Peter Cordes, Jun 19 '21 at 15:49
Note that cpuid is guaranteed to always cause a VM exit. Use `lfence` instead to serialize just instruction execution if you still want to use `rdtsc`. IDK if qemu + kvm has support yet for HW perf counters (`perf record`, or manual `rdpmc` with the right setup by the kernel) so lfence + rdtsc that might be the best you can do in a VM. — Peter Cordes, Jun 19 '21 at 15:51
@PeterCordes: Wouldn't reference cycles make more sense (because of reproduceability)? Also, does the cpuid overhead matter? I would have thought that would elapse before and after the calls to rdtsc/rdtscp respectively and as such not affect the result. — Peter, Jun 19 '21 at 15:56
A VM exit will disturb cache and branch prediction state from the code that runs in the hypervisor. More importantly, unless you can actually use `rdtscp`, the bottom of your timed region should use `lfence` + `rdtsc` in that order, to make sure every instruction in the timed region has retired before `rdtsc` is allowed to execute. Using CPUID would put a vm exist *inside* the timed region. — Peter Cordes, Jun 19 '21 at 16:00
Actual cycles are usually *more* reproducible, for code that doesn't bottleneck on memory at least. e.g. a loop that bottlenecks on a dependency chain 4 cycles long (e.g. an FP FMA or ADDPS on Skylake) and runs for 10k iterations, will take about 40k core clock cycles regardless of the clock frequency. But if L2 cache misses are involved, then L3 / DRAM latency in core clocks (i.e. how much of a stall OoO exec has to hide) depends on the clock speed. So generally you want to do warm-up runs to get the CPU up to a consistent speed for serious benchmarking. — Peter Cordes, Jun 19 '21 at 16:03

Measuring CPU cycles under qemu (x86_64)

0 Answers0