0

I have 4 test functions - foo1(), foo2(), foo3() and foo4(). For measurements I use following program:

unsigned __int64 start;
unsigned __int64 stop;
unsigned __int64 sum;
unsigned __int64 orig;

sum = 0;
for (int i = 0; i < 10000; i++)
{
    start = __rdtsc();
    foo1();
    stop = __rdtsc();
    sum += (stop - start);
}
orig=sum;
cout << "foo1() \taverage: " << (sum / 10000.0) << ", \tratio: " << ((double)orig / sum) << endl << endl;

sum = 0;
for (int i = 0; i < 10000; i++)
{
    start = __rdtsc();
    foo2();
    stop = __rdtsc();
    sum += (stop - start);
}
cout << "foo2() \taverage: " << (sum / 10000.0) << ", \tratio: " << ((double)orig / sum) << endl << endl;

And so on, for foo3() and foo4().

I have this log on console:

foo1()      average: 401495,        ratio: 1
foo2()      average: 24251.2,       ratio: 16.5557
foo3()      average: 11497.7,       ratio: 34.9195
foo4()      average: 7439.06,       ratio: 53.9713

Does this mean that using foo4 () ~50 times faster (in real time) than foo1 ()?

OR does this mean that foo4() is DEFINITELY better in performance than foo1()?

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • 1
    When you benchmark, do you make sure that CPU frequency stays constant? – Maxim Egorushkin May 05 '20 at 09:32
  • What would be the result if you execute the `foo` functions in an interleaved way (ie. in the same loop)? I also advise you to throw away the first iterations to reduce possible biases (eg. the impact of the memory/cache latency on the first call). Moreover, computing the standard deviation of the timings (and not just the mean) could help you to track possible benchmarking/performance issues and could also provide hints on whether you can safely compare the averages or not. – Jérôme Richard May 05 '20 at 11:00

1 Answers1

1

"faster (in real time)" and "better in performance" are equivalent in a single-threaded, uncontended context, provided the measurement is correct.

From the looks of it, a 50x speedup measurement is a sure sign that the function is faster, and thus better, than the baseline.

BUT before you conclude:

  1. Wrap your code into another outer loop, and loop for at least a couple of hundred milliseconds, throwing out the results of the first 200 ms, then take the average of the remaining measurements. This is especially important if the functions under test access memory. Memory caching effects can account for 100x+ elapsed time difference.

  2. Add a call to _mm_lfence() after each foo invocation to ensure all of its instructions have retired before taking the clock measurement.

Microbenchmarking is hard; this often means that some functions can be meaningfully measured only in combination, also the amount of data processed should be roughly equivalent to a real-life scenario.

rustyx
  • 80,671
  • 25
  • 200
  • 267