2

Currently I am trying to measure number of clock cycles taken to complete an operation by two different programming languages on same environment. (without using an OS)

Currently I am using Qemu-i386 emulator and using rdtsc to measure the clock cycles.

/* Return the number of CPU ticks since boot. */
static inline u64 rdtsc(void)
{
    u32 hi, lo;
    // asm("cpuid");
    asm("rdtsc" : "=a" (lo), "=d" (hi));
    return ((u64) lo) | (((u64) hi) << 32);
}

Taking the difference between rdtsc before and after operation should provide the number of clock cycles.

    start_time = rdtsc();
    operation();
    stop_time = rdtsc();
    num_cycles = stop_time-start_time;

But the difference is not constant even when I take over 100s of iterations and varies by few thousands of cycles.

  • Is there any better way of measuring clock cycles?

  • Also is there any way of providing frequency as an input parameter in Qemu? Currently I am using

qemu-system-i386 -kernel out.elf

cadaniluk
  • 15,027
  • 2
  • 39
  • 67
madstr
  • 23
  • 6
  • 3
    A thousand cycles is 0.000001 seconds on a 1 GHz CPU. That's just noise. You'll need to increase the number of iterations (eg. 1,000,000s) so that the noise becomes insignificant. – Ross Ridge Nov 25 '15 at 21:17

1 Answers1

3

Trying to benchmark guest software under QEMU emulation is at best extremely difficult. QEMU's emulation does not have performance characteristics that are anything like a real hardware CPU's: some operations that are fast on hardware, like floating point, are very slow on QEMU; we don't model caches and you won't see anything like the performance curves you would see as data sets reach cache line or L1/L2/etc cache size limits; and so on.

Important factors in performance on a modern CPU include (at least):

  • raw instruction counts executed
  • TLB misses
  • branch predictor misses
  • cache misses

QEMU doesn't track any of the last three and only makes a vague attempt at the first one if you use the -icount option. (In particular, without -icount the RDTSC value we provide to the guest under emulation is more-or-less just the host CPU RDTSC value, so times measured with it will include all sorts of QEMU overhead including time spent translating guest code.)

Assuming you're on an x86 host, you could try the -enable-kvm option to run this under a KVM virtual machine. Then at least you'll be looking at the real performance of a hardware CPU, though you will still see some noise from the overhead as other host processes contend for CPU with the VM.

Peter Maydell
  • 9,707
  • 1
  • 19
  • 25
  • Thanks @peter Now I have included both icount and kvm option. After adding icount option, QEMU gives same number of clock counts for each execution. Since I am comparing performance of 2 languages, whatever overhead added by QEMU will be for both languages. Can I still consider the results and come to a conclusion? – madstr Jan 20 '16 at 17:50
  • 1
    No, because the "cycle counts" reported by QEMU don't bear any interesting relationship to real CPU behaviour. Program A could be faster by count-of-instructions on QEMU than Program B, but run slower on a real CPU. – Peter Maydell Aug 05 '16 at 16:39
  • @PeterMaydell Can you give me an example how to get the instruction count in Qemu? I run this command. qemu-system-mips -icount shift=7,rr=record,rrfile=replay.bin -net none -M malta -kernel vmlinux-2.6.32-5-4kc-malta -hda debian_squeeze_mips_standard.qcow2 -append "root=/dev/sda1 console=tty0" -m 400 -redir tcp:8022::22 – Sam May 23 '17 at 16:56