Measure CPU speed by counting assembly instructions

Question

Edit: My original example had a silly mistake. After fixing it I still get weird results, though.

In my naive attempt to measure my CPU speed the "brute-force" way, I made the program below:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#pragma comment(linker, "/entry:mainCRTStartup")
#pragma comment(linker, "/Subsystem:Console")

int mainCRTStartup()
{
    char buf[20];
    clock_t start, elapsed;
    unsigned long count = 0;
    start = clock();
    __asm
    {
        mov EAX, 0;
    _loop:
        add EAX, 3; // accounts for itself and next 2 instructions
        cmp EAX, 0xFFFFFFFF - 0x400;
        jb _loop;
        mov count, EAX;
    }
    elapsed = clock() - start;
    _gcvt(count * (long long)CLOCKS_PER_SEC / (elapsed * 1000000000.0), 3, buf);
    puts(buf);
}

Which disassembles into something like:

mainCRTStartup:
  push   ebp
  mov    ebp,esp
  sub    esp,28h
  mov    dword ptr [count],0
  call   dword ptr [_clock]
  mov    dword ptr [start],eax
  mov    eax,0

_loop:
  add    eax,03h
  cmp    eax,0FFFFFBFFh
  jb     _loop

  mov    dword ptr [count],eax
  call   dword ptr [_clock]
  sub    eax,dword ptr [start]

  ...    // call _gcvt, _puts, etc.

  mov    esp,ebp
  pop    ebp
  ret

Notice that the loop is 3 instructions, so the final value of eax should be the total number of instructions.

Why do I get 4.2 when I run this?

@Lie Ryan: how would that explain why the program over-estimates his clock frequency? — Justin M. Keyes, Aug 10 '11 at 03:49

Justin M. Keyes · Accepted Answer · 2011-08-10T14:22:19.087

12

Because instruction-level parallelism and superscalar architecture allow multiple instructions to execute in a single pipelined clock cycle.

For example, in your code, branch prediction effectively eliminates the cmp instruction for all but the last _loop iteration, by:

executing cmp and jb in parallel, and
always taking the jb branch.

Of course, (2) is thrown out on the last iteration, which causes the pipeline to be cleared. The extra ~20 cycles (for a 20-stage pipeline) are negligible since your loop is on the order of 10^9 instructions.

the compiler shouldn't be optimizing this

The processor hardware is always looking for optimization opportunities in the datapath; compilers just try to organize instructions to exploit a given architecture's patterns. E.g., hardware pipelining can increase IPC without software pipelining, especially for relatively hazard-free code such as your example.

edited Aug 10 '11 at 14:22

answered Aug 10 '11 at 03:47

Justin M. Keyes

6,679
3
33
60

**I don't think this has to do with pipelining**, since pipelining doesn't change the long-term rate. But I think I see what you mean now regarding superscalar architectures... so a 2 GHz CPU might actually execute 4 giga-instructions per second (assuming the memory keeps up)? – user541686 Aug 10 '11 at 04:11
For pipelining yes, but combined with redundant functional elements (i.e., superscalar architecture), and spatial locality, a loop like the one in your code is easily optimized by the compiler to execute many times faster than 1 IPS. – Justin M. Keyes Aug 10 '11 at 04:14
Well the compiler shouldn't be optimizing this (I didn't tell it to, and it doesn't look like it did), but I think your reasoning still holds because this is already kind of optimized, +1; thanks! – user541686 Aug 10 '11 at 04:15
Justin: "5-stage pipeline could theoretically (roughly...) give you a 5x speedup [which] would look like 13.8 GHz to your naive program" This is utter nonsense. You seem to confuse superscalar execution with pipelining. A 5-stage pipelining will cause each instruction to be executed for at least 5 clocks (in small portions) together with 4 other instructions, which are all in different states of execution. This leads to 1 instruction/cycle in the optimal case. – Gunther Piez Aug 10 '11 at 09:30
@drhirsch: I updated my answer significantly... let me know what you think. – Justin M. Keyes Aug 10 '11 at 14:23

AShelly · Answer 2 · 2011-08-10T05:13:21.400

9

Because CPU speed is not measured in bytes per sec, but instruction cycles per sec, and especially on x86, some instructions take more than 1 cycle.

See this page for instruction timing. (Actually, this only goes up to 486 - still looking for a good reference for modern processors).

edited Aug 10 '11 at 05:13

answered Aug 10 '11 at 03:42

AShelly

34,686
15
91
152

WOWWWWW I can't believe I totally forgot about that (part 1, not part 2 -- the latter would underestimate, not overestimate), even though it's blindingly obvious!! That was a really stupid mistake. Changing 3 to 2 gives me a *much* more accurate speed. Thanks for the answer! – user541686 Aug 10 '11 at 03:58
1

Instructions taking *more* than 1 cycle would result in his program under-estimating his clock frequency--but instead it is over-estimating. – Justin M. Keyes Aug 10 '11 at 03:58
@AShelly: Actually, I'm going to *unaccept* this because of something I'll post in a few minutes -- it's still weird. – user541686 Aug 10 '11 at 04:02
@AShelly: Your link is broken, btw. – user541686 Aug 10 '11 at 04:12

score 2 · Answer 3 · answered Aug 10 '11 at 03:49

How many cycles an instruction takes to execute has no direct relation to it's size in bytes. In addition with the modern CPU features like multiple execution units and speculative execution it's not really possible to determine before hand how long a given chunk of code will take to execute with much accuracy.

Olof Forshell · Answer 4 · 2011-08-11T08:08:04.393

You could measure the loop using the rdtsc instruction which counts the cycles of a CPU's internal frequency. The difference between two readings is the number of cycles elapsed. Let your code execute 1000 loops,multiply by three (instructions in a loop) and divide by the elapsed cycles. This will give you instructions per cycle. Which you then can scale to your CPU's own frequency.

Keep in mind that since your code is so short it will most likely execute from cache level 1 (or inside the prefetcher?) which makes it valid for that case only and not for the CPU in general. It may be too short for the pipelining to make something worthwhile out of.

As to instruction timing this page appears more up to date than the one AShelly suggested. It is regularly recomputed by Torbjörn Granlund of the Swedish Royal Institute of Technology.

Measure CPU speed by counting assembly instructions

4 Answers4