Why does my Google Benchmark result depend on the order of execution?

Question

I am trying to benchmark the potential performance increase of emplace_back vs. push_back using Google Benchmark:

#include <benchmark/benchmark.h>

#include <vector>
#include <string>
using namespace std;

static void BM_pushback(benchmark::State& state) {

    for(auto _ : state) {
        vector<string> data;
        for(int i=0;i<10000000;++i)
            data.push_back("A long string to avoid sbo"); 
    }
}

static void BM_emplaceback(benchmark::State& state) {

    for(auto _ : state) {
        vector<string> data;
        for(int i=0;i<10000000;++i)
            data.emplace_back("A long string to avoid sbo");
    }
}

BENCHMARK(BM_pushback)->UseRealTime();
BENCHMARK(BM_emplaceback)->UseRealTime();

BENCHMARK_MAIN();

Using g++ with -O3 I get:

Running ./benchmark
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.60, 0.48, 0.53
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_pushback/real_time     886383300 ns    886276679 ns            1
BM_emplaceback/real_time  698194513 ns    698159138 ns            1

If I, however, change the order of execution, that is

BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK(BM_pushback)->UseRealTime();

I receive opposite timings:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_emplaceback/real_time  863837627 ns    863814885 ns            1
BM_pushback/real_time     765406073 ns    765407578 ns            1

Can someone explain this behavior? I think, this might have something to do with caching but what exactly is going on here? Just as a note: What seems to help is to reduce the benchmarking size, i.e. adding only 10000 instead of 10000000 strings to my vector (which leads to more cycles being performed by the library).

On `-O3`, your two inner loops will most likely be optimized to the point where they produce the same assembly, since the data you're adding is compile-time constant. Also the resulting vector is unused, so it may get optimized away entirely, `benchmark::DoNotOptimize` could help with that. — perivesta, Aug 26 '22 at 10:27
You are almost certainly seeing the effect of malloc caching allocations. Try calling ```malloc_trim(0)``` from `````` in between benchmarks — Homer512, Aug 26 '22 at 10:31
@Homer512 I tried your solution and I added a call to `malloc_trim(0)` at the beginning of the second function to be called, leading to almost identical run times. Thanks a lot for the great suggestion! — Urwald, Aug 26 '22 at 10:40
@perivesta No they won't. If you find a compiler that is sufficiently magical to perform these optimizations, let me know. But since the string constructor is not a fully inlined function, compilers are not able to do the kind of optimization that you propose. Feel free to check with Godbolt. https://godbolt.org/z/PKTzEc1YG — Homer512, Aug 26 '22 at 10:40
@perivesta The optimization does actually work with ```std::unique_ptr``` but even ```std::vector``` with ```reserve(n)``` is too complicated for compilers — Homer512, Aug 26 '22 at 11:06

Why does my Google Benchmark result depend on the order of execution?

0 Answers0