I am considering using rangev3 in a library of mine. I like rangev3's syntax, but the priority is performance. The library runs lots of vector multiplications and additions, mostly 128 samples long. I used Google benchmark to assess for instance addition of two vectors. The ranges version is much slower than the STD version (almost 10x slower for short vector lengths). This is somewhat surprising, as rangev3 (and the future std::ranges in C++20) is often claimed to have good performance.
Is there an issue with how I am using rangev3 here? Or is it something to do with the compiler not being able to unroll rangev3 code well? Or does rangev3 performance gains show up only for many daisy-chained operations?
Notes: the output = rng1;
assignment should not allocate memory as the vector length is the same (I tried to use ranges::copy, but it becomes 100 times slower). I tried to pre-initialise and randomise the vectors A and B, but saw no difference. I did notice that if I had more operations in a pipeline, the gap between STL and ragesv3 narrowed, but only for long vectors (above 32000 for 5 consecutive operations).
Below is a self-contained example with performance metrics. I am running C++17 LLVM libc++ on a 4-core i7 MacBook Pro with -O3 flag.
#include <range/v3/all.hpp>
#include <benchmark/benchmark.h>
static void AddBenchmark(benchmark::State& state) {
const size_t length = state.range(0);
std::vector<double> B(length);
std::vector<double> A(length);
std::vector<double> output(length);
while (state.KeepRunning()) {
std::transform(A.begin( ), A.end( ), B.begin( ), output.begin(), std::plus<>( ));
benchmark::ClobberMemory(); // Force output to be written to memory.
}
}
BENCHMARK(AddBenchmark)->RangeMultiplier(8)->Range(1<<7, 1<<20);
static void AddRangesBenchmark(benchmark::State& state) {
const size_t length = state.range(0);
std::vector<double> B(length);
std::vector<double> A(length);
std::vector<double> output(length);
while (state.KeepRunning()) {
auto rng1 = ranges::view::transform(A, B, std::plus<>( ));
output = ranges::to<std::vector<double>>(rng1);
benchmark::ClobberMemory(); // Force output to be written to memory.
}
}
BENCHMARK(AddRangesBenchmark)->RangeMultiplier(8)->Range(1<<7, 1<<20);
BENCHMARK_MAIN();
which outputs
AddBenchmark/128 30.3 ns 30.2 ns 23194091
AddBenchmark/512 121 ns 121 ns 5758094
AddBenchmark/4096 1917 ns 1906 ns 417300
AddBenchmark/32768 25054 ns 24795 ns 28182
AddBenchmark/262144 385913 ns 382803 ns 1718
AddBenchmark/1048576 2100095 ns 2096442 ns 328
AddRangesBenchmark/128 218 ns 218 ns 3131249
AddRangesBenchmark/512 579 ns 579 ns 1169688
AddRangesBenchmark/4096 5071 ns 5069 ns 123231
AddRangesBenchmark/32768 50702 ns 50649 ns 14382
AddRangesBenchmark/262144 482216 ns 481333 ns 1288
AddRangesBenchmark/1048576 3349331 ns 3347475 ns 200