5

I'm trying to benchmark relatively small portions of a set of larger algorithms that are implemented in C++. In simplification, one could say that each algorithm is implemented via two functions (let's call them foo() and bar()) that can be called repeatedly and in arbitary order and these functions may make changes to some algorithm-internal data structures. Among other things, I want to compare the algorithms' performances by separately measuring the total time spent in foo() and bar(), respectively.

Now I have two algorithms: Algorithm A does some work in foo(), but very little in bar(), whereas Algorithm B does absolutely nothing in foo() (foo() is in fact an empty function here), but a lot of work in bar(). The unexpected thing I observed here is that the total time Algorithm B spends in foo() is in many scenarios larger than the total time Algorithm A spends in foo(). After some debugging, I found out that for Algorithm B, the first call to foo() after a call to bar() takes a lot of time, whereas subsequent calls to foo() tend to be faster.

In an attempt to pin down this effect, I came up with the following simplification of Algorithm B, which consists of an empty function (corresponds to foo()) and two functions where I tried to simulate work (correspond to bar(), the "real" bar() basically also just allocates space and iterates over data structures):

b.h:

#ifndef B_H
#define B_H

void foo_emptyFunction(unsigned long long u); // foo()
void bar_expensiveFunction1();   // bar() - version 1
void bar_expensiveFunction2();   // bar() - version 2

#endif

b.cpp

#include "b.h"

#include <iostream>
#include <vector>
#include <math.h>


void foo_emptyFunction(unsigned long long )
{
    // nothing
}

void bar_expensiveFunction1() {
  std::vector<unsigned long> vec;
  for (auto i = 0UL; i < 1000000UL; i++) {
      vec.push_back(i);
  }
  std::cout << "Created and filled a vector with " << vec.size() << " elements." << std::endl;
}

void bar_expensiveFunction2() {
  std::vector<unsigned long> vec;
  for (auto i = 1UL; i <= 1000000UL; i++) {
      vec.push_back(i);
  }
        auto sum = 0ULL;
        auto sumSqrts = 0.0;
        for (auto i : vec) {
                sum += i;
                sumSqrts += sqrt(i);
        }
  std::cout << "Sum of elements from " << vec.front()
                << " to " << vec.back() << " is " << sum
                << ", the sum of their square roots is " << sumSqrts << "." << std::endl;
}

I then try to measure the time needed to call the empty function multiple times after an ``expensive'' one:

main.cpp:

#include "b.h"

#include <chrono>
#include <thread>

#include <iostream>

#include <math.h>

typedef std::chrono::high_resolution_clock sclock;
typedef unsigned long long time_interval;
typedef std::chrono::duration<time_interval, std::chrono::nanoseconds::period> time_as;

void timeIt() {
        auto start = sclock::now();
        auto end = start;

        for (auto i = 0U; i < 10U; i++) {
                start = sclock::now();
                asm volatile("" ::: "memory");
                foo_emptyFunction(1000ULL);
                asm volatile("" ::: "memory");
                end = sclock::now();
                std::cout << "Call #" << i << " to empty function took " << std::chrono::duration_cast<time_as>(end - start).count() << "ns." << std::endl;
        }
}

int main()
{
        timeIt();

        bar_expensiveFunction1();

        timeIt();

        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        std::cout << "Slept for 100ms." << std::endl;

        timeIt();

        bar_expensiveFunction2();

        timeIt();

        bar_expensiveFunction1();

        timeIt();

        return 0;
}

If I compile (g++ -o test main.cpp b.cpp or also g++ -O3 -o test main.cpp b.cpp) and run the code, I get an output similar to this:

./test

Call #0 to empty function took 79ns.
Call #1 to empty function took 57ns.
Call #2 to empty function took 55ns.
Call #3 to empty function took 31ns.
Call #4 to empty function took 35ns.
Call #5 to empty function took 26ns.
Call #6 to empty function took 26ns.
Call #7 to empty function took 36ns.
Call #8 to empty function took 24ns.
Call #9 to empty function took 26ns.
Created and filled a vector with 1000000 elements.
Call #0 to empty function took 84ns.
Call #1 to empty function took 27ns.
Call #2 to empty function took 28ns.
Call #3 to empty function took 27ns.
Call #4 to empty function took 29ns.
Call #5 to empty function took 27ns.
Call #6 to empty function took 29ns.
Call #7 to empty function took 33ns.
Call #8 to empty function took 28ns.
Call #9 to empty function took 23ns.
Slept for 100ms.
Call #0 to empty function took 238ns.
Call #1 to empty function took 106ns.
Call #2 to empty function took 102ns.
Call #3 to empty function took 118ns.
Call #4 to empty function took 199ns.
Call #5 to empty function took 92ns.
Call #6 to empty function took 216ns.
Call #7 to empty function took 118ns.
Call #8 to empty function took 113ns.
Call #9 to empty function took 107ns.
Sum of elements from 1 to 1000000 is 500000500000, the sum of their square roots is 6.66667e+08.
Call #0 to empty function took 126ns.
Call #1 to empty function took 35ns.
Call #2 to empty function took 31ns.
Call #3 to empty function took 30ns.
Call #4 to empty function took 38ns.
Call #5 to empty function took 54ns.
Call #6 to empty function took 29ns.
Call #7 to empty function took 35ns.
Call #8 to empty function took 30ns.
Call #9 to empty function took 29ns.
Created and filled a vector with 1000000 elements.
Call #0 to empty function took 112ns.
Call #1 to empty function took 23ns.
Call #2 to empty function took 23ns.
Call #3 to empty function took 23ns.
Call #4 to empty function took 23ns.
Call #5 to empty function took 22ns.
Call #6 to empty function took 23ns.
Call #7 to empty function took 23ns.
Call #8 to empty function took 24ns.
Call #9 to empty function took 23ns.

I suspect that the differences in running time, in particular the peak at the first call, may be due to some kind of caching effect, but I'd really like to understand what's going on here.

Edit: The effect I observe here is very similar to the one in the real code. There is almost always a huge peak at the first call and it is rather constant from the third call onwards. The effect is even a bit more pronounced in the real code, I suspect because B::bar() does more work in reality (it traverses a graph and not just a list of integers). Unfortunately, the real code is part of a pretty large project, so I can't post it here. The code I posted above is a pretty heavy simplification of the original, but it seems to show the same effect. In reality, both foo() and bar() are virtual (I know this comes with a time penalty) and in a different compiliaton unit, so the compiler cannot optimize the function call away. Plus, I also checked the assembler of the real program. I am also aware of the fact that I unavoidably also measure the time for calling now() - but I use the same benchmarking code for Algorithm A (which does at least something in its implementation of foo()) and the total time measured for A::foo() is less... The optimization level does not seem to have (much) influence on this effect and I get the same behavior with clang.

Edit 2: I also ran the algorithm benchmarks on a dedicated machine (Linux, only system processes, cpu frequency governor set to performance).

Also, I am aware that usually when you do this kind of micro-benchmarking, you do stuff such as cache-warming and repeating the portion of code you want to benchmark many times. Unfortunately, each call to foo() or bar() potentially modifies the internal data structures, so I can't just repeat them. I'd appreciate any suggestions for improvements.

Thanks!

Community
  • 1
  • 1
Qundercut
  • 51
  • 4
  • 2
    For your next question, please try and keep your names consistents. We've spend time to learn the name of your two functions `foo()` and `bar()` but they're named differently in the code ;) – YSC Jan 25 '19 at 15:09
  • To be clear: Does the code that you show actually exhibit the timing behavior you are speaking of? Or is this just some illustration where the real problem cannot be observed? – Max Langhof Jan 25 '19 at 15:34
  • 1
    Can you add your compile command? – JVApen Jan 25 '19 at 15:36
  • Did you watch the CPU usage while your test is happening? Is there some antivirus/preemptive paging/whatever going on in other processes (including the system)? Did you keep an eye on e.g. CPU frequency, memory frequency and similar? I know this sounds really stupid but I've observed the dumbest things (e.g. throughput getting better by creating a `while (true);` thread beforehand) on some systems... – Max Langhof Jan 25 '19 at 15:42
  • Also note that you are mainly profiling the `now()` function of your clock here. If the linker didn't eliminate the call to `uselessEmptyFunction()` entirely, it at least won't do more than a `mov`, `call` and `ret` for that, which is significantly less than whatever system call `now()` ends up doing. Feel free to look at the assembly to confirm. Or look at the results you get when you remove the call to `uselessEmptyFunction()`. – Max Langhof Jan 25 '19 at 15:54
  • Sorry, @YSC, I thought `emptyFunction()` and `expensiveFunction()` might be more descriptive, but I see your point. I changed them to `foo_emptyFunction()` and `bar_expensiveFunction()` – Qundercut Jan 25 '19 at 18:37
  • It's not a big deal, just an idea to improve your future questions. `xxx_foo()`/`yyy_bar()` is a nice compromise ^^ – YSC Jan 25 '19 at 18:48

1 Answers1

2

I notice that after the sleep the benchmark performs worse. That may be due to CPU going into a lower frequency/power mode.

Before benchmarking fix the CPU frequency to its maximum, so that the CPU doesn't adjust it during the benchmark.

On Linux:

$ sudo cpupower --cpu all frequency-set --related --governor performance

On Windows set the power plan to "High Performance" .

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Thanks! Indeed, I forgot to set the cpu frequency governor to performance before I posted the output of the simplified code :( Sorry. That explains the extremely high peak after the sleep. The peak is "normal" (like the other peaks) now after I set the governor to performance. – Qundercut Jan 25 '19 at 18:56
  • Could you clarify whether fixing the CPU frequency resolved your issue? – Maxim Egorushkin Jan 25 '19 at 21:50
  • No, it just makes the peak (Call #0) after the sleep smaller, but it's still there. As are all the others in the simplified code. When running benchmarks for the algorithms, I always had the cpu frequency governor set to performance, so this does not explain the issue, unfortunately. – Qundercut Jan 26 '19 at 21:28