I am trying to benchmark the potential performance increase of emplace_back
vs. push_back
using Google Benchmark:
#include <benchmark/benchmark.h>
#include <vector>
#include <string>
using namespace std;
static void BM_pushback(benchmark::State& state) {
for(auto _ : state) {
vector<string> data;
for(int i=0;i<10000000;++i)
data.push_back("A long string to avoid sbo");
}
}
static void BM_emplaceback(benchmark::State& state) {
for(auto _ : state) {
vector<string> data;
for(int i=0;i<10000000;++i)
data.emplace_back("A long string to avoid sbo");
}
}
BENCHMARK(BM_pushback)->UseRealTime();
BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK_MAIN();
Using g++
with -O3
I get:
Running ./benchmark
Run on (8 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.60, 0.48, 0.53
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_pushback/real_time 886383300 ns 886276679 ns 1
BM_emplaceback/real_time 698194513 ns 698159138 ns 1
If I, however, change the order of execution, that is
BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK(BM_pushback)->UseRealTime();
I receive opposite timings:
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_emplaceback/real_time 863837627 ns 863814885 ns 1
BM_pushback/real_time 765406073 ns 765407578 ns 1
Can someone explain this behavior? I think, this might have something to do with caching but what exactly is going on here? Just as a note: What seems to help is to reduce the benchmarking size, i.e. adding only 10000
instead of 10000000
string
s to my vector
(which leads to more cycles being performed by the library).