I'm trying to benchmark relatively small portions of a set of larger algorithms that are implemented in C++. In simplification, one could say that each algorithm is implemented via two functions (let's call them foo()
and bar()
) that can be called repeatedly and in arbitary order and these functions may make changes to some algorithm-internal data structures. Among other things, I want to compare the algorithms' performances by separately measuring the total time spent in foo()
and bar()
, respectively.
Now I have two algorithms: Algorithm A does some work in foo()
, but very little in bar()
, whereas Algorithm B does absolutely nothing in foo()
(foo()
is in fact an empty function here), but a lot of work in bar()
. The unexpected thing I observed here is that the total time Algorithm B spends in foo()
is in many scenarios larger than the total time Algorithm A spends in foo()
. After some debugging, I found out that for Algorithm B, the first call to foo() after a call to bar() takes a lot of time, whereas subsequent calls to foo() tend to be faster.
In an attempt to pin down this effect, I came up with the following simplification of Algorithm B, which consists of an empty function (corresponds to foo()
) and two functions where I tried to simulate work (correspond to bar()
, the "real" bar()
basically also just allocates space and iterates over data structures):
b.h:
#ifndef B_H
#define B_H
void foo_emptyFunction(unsigned long long u); // foo()
void bar_expensiveFunction1(); // bar() - version 1
void bar_expensiveFunction2(); // bar() - version 2
#endif
b.cpp
#include "b.h"
#include <iostream>
#include <vector>
#include <math.h>
void foo_emptyFunction(unsigned long long )
{
// nothing
}
void bar_expensiveFunction1() {
std::vector<unsigned long> vec;
for (auto i = 0UL; i < 1000000UL; i++) {
vec.push_back(i);
}
std::cout << "Created and filled a vector with " << vec.size() << " elements." << std::endl;
}
void bar_expensiveFunction2() {
std::vector<unsigned long> vec;
for (auto i = 1UL; i <= 1000000UL; i++) {
vec.push_back(i);
}
auto sum = 0ULL;
auto sumSqrts = 0.0;
for (auto i : vec) {
sum += i;
sumSqrts += sqrt(i);
}
std::cout << "Sum of elements from " << vec.front()
<< " to " << vec.back() << " is " << sum
<< ", the sum of their square roots is " << sumSqrts << "." << std::endl;
}
I then try to measure the time needed to call the empty function multiple times after an ``expensive'' one:
main.cpp:
#include "b.h"
#include <chrono>
#include <thread>
#include <iostream>
#include <math.h>
typedef std::chrono::high_resolution_clock sclock;
typedef unsigned long long time_interval;
typedef std::chrono::duration<time_interval, std::chrono::nanoseconds::period> time_as;
void timeIt() {
auto start = sclock::now();
auto end = start;
for (auto i = 0U; i < 10U; i++) {
start = sclock::now();
asm volatile("" ::: "memory");
foo_emptyFunction(1000ULL);
asm volatile("" ::: "memory");
end = sclock::now();
std::cout << "Call #" << i << " to empty function took " << std::chrono::duration_cast<time_as>(end - start).count() << "ns." << std::endl;
}
}
int main()
{
timeIt();
bar_expensiveFunction1();
timeIt();
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::cout << "Slept for 100ms." << std::endl;
timeIt();
bar_expensiveFunction2();
timeIt();
bar_expensiveFunction1();
timeIt();
return 0;
}
If I compile (g++ -o test main.cpp b.cpp
or also g++ -O3 -o test main.cpp b.cpp
) and run the code, I get an output similar to this:
./test
Call #0 to empty function took 79ns.
Call #1 to empty function took 57ns.
Call #2 to empty function took 55ns.
Call #3 to empty function took 31ns.
Call #4 to empty function took 35ns.
Call #5 to empty function took 26ns.
Call #6 to empty function took 26ns.
Call #7 to empty function took 36ns.
Call #8 to empty function took 24ns.
Call #9 to empty function took 26ns.
Created and filled a vector with 1000000 elements.
Call #0 to empty function took 84ns.
Call #1 to empty function took 27ns.
Call #2 to empty function took 28ns.
Call #3 to empty function took 27ns.
Call #4 to empty function took 29ns.
Call #5 to empty function took 27ns.
Call #6 to empty function took 29ns.
Call #7 to empty function took 33ns.
Call #8 to empty function took 28ns.
Call #9 to empty function took 23ns.
Slept for 100ms.
Call #0 to empty function took 238ns.
Call #1 to empty function took 106ns.
Call #2 to empty function took 102ns.
Call #3 to empty function took 118ns.
Call #4 to empty function took 199ns.
Call #5 to empty function took 92ns.
Call #6 to empty function took 216ns.
Call #7 to empty function took 118ns.
Call #8 to empty function took 113ns.
Call #9 to empty function took 107ns.
Sum of elements from 1 to 1000000 is 500000500000, the sum of their square roots is 6.66667e+08.
Call #0 to empty function took 126ns.
Call #1 to empty function took 35ns.
Call #2 to empty function took 31ns.
Call #3 to empty function took 30ns.
Call #4 to empty function took 38ns.
Call #5 to empty function took 54ns.
Call #6 to empty function took 29ns.
Call #7 to empty function took 35ns.
Call #8 to empty function took 30ns.
Call #9 to empty function took 29ns.
Created and filled a vector with 1000000 elements.
Call #0 to empty function took 112ns.
Call #1 to empty function took 23ns.
Call #2 to empty function took 23ns.
Call #3 to empty function took 23ns.
Call #4 to empty function took 23ns.
Call #5 to empty function took 22ns.
Call #6 to empty function took 23ns.
Call #7 to empty function took 23ns.
Call #8 to empty function took 24ns.
Call #9 to empty function took 23ns.
I suspect that the differences in running time, in particular the peak at the first call, may be due to some kind of caching effect, but I'd really like to understand what's going on here.
Edit: The effect I observe here is very similar to the one in the real code. There is almost always a huge peak at the first call and it is rather constant from the third call onwards. The effect is even a bit more pronounced in the real code, I suspect because B::bar()
does more work in reality (it traverses a graph and not just a list of integers). Unfortunately, the real code is part of a pretty large project, so I can't post it here. The code I posted above is a pretty heavy simplification of the original, but it seems to show the same effect. In reality, both foo()
and bar()
are virtual (I know this comes with a time penalty) and in a different compiliaton unit, so the compiler cannot optimize the function call away. Plus, I also checked the assembler of the real program. I am also aware of the fact that I unavoidably also measure the time for calling now() - but I use the same benchmarking code for Algorithm A (which does at least something in its implementation of foo()
) and the total time measured for A::foo()
is less...
The optimization level does not seem to have (much) influence on this effect and I get the same behavior with clang.
Edit 2: I also ran the algorithm benchmarks on a dedicated machine (Linux, only system processes, cpu frequency governor set to performance).
Also, I am aware that usually when you do this kind of micro-benchmarking, you do stuff such as cache-warming and repeating the portion of code you want to benchmark many times. Unfortunately, each call to foo()
or bar()
potentially modifies the internal data structures, so I can't just repeat them. I'd appreciate any suggestions for improvements.
Thanks!