Why is this rangev3 implementation of vectors summation slower than the STD equivalent?

Question

I am considering using rangev3 in a library of mine. I like rangev3's syntax, but the priority is performance. The library runs lots of vector multiplications and additions, mostly 128 samples long. I used Google benchmark to assess for instance addition of two vectors. The ranges version is much slower than the STD version (almost 10x slower for short vector lengths). This is somewhat surprising, as rangev3 (and the future std::ranges in C++20) is often claimed to have good performance.

Is there an issue with how I am using rangev3 here? Or is it something to do with the compiler not being able to unroll rangev3 code well? Or does rangev3 performance gains show up only for many daisy-chained operations?

Notes: the output = rng1; assignment should not allocate memory as the vector length is the same (I tried to use ranges::copy, but it becomes 100 times slower). I tried to pre-initialise and randomise the vectors A and B, but saw no difference. I did notice that if I had more operations in a pipeline, the gap between STL and ragesv3 narrowed, but only for long vectors (above 32000 for 5 consecutive operations).

Below is a self-contained example with performance metrics. I am running C++17 LLVM libc++ on a 4-core i7 MacBook Pro with -O3 flag.

#include <range/v3/all.hpp>
#include <benchmark/benchmark.h>

static void AddBenchmark(benchmark::State& state) {
  const size_t length = state.range(0);

  std::vector<double> B(length);
  std::vector<double> A(length);
  std::vector<double> output(length);
  
  while (state.KeepRunning()) {
    std::transform(A.begin( ), A.end( ), B.begin( ), output.begin(), std::plus<>( ));
    benchmark::ClobberMemory(); // Force output to be written to memory.
  }
}
BENCHMARK(AddBenchmark)->RangeMultiplier(8)->Range(1<<7, 1<<20);


static void AddRangesBenchmark(benchmark::State& state) {
  const size_t length = state.range(0);

  std::vector<double> B(length);
  std::vector<double> A(length);
  std::vector<double> output(length);
  
  while (state.KeepRunning()) {
    auto rng1 = ranges::view::transform(A, B, std::plus<>( ));
    output = ranges::to<std::vector<double>>(rng1);
    benchmark::ClobberMemory(); // Force output to be written to memory.
  }
}
BENCHMARK(AddRangesBenchmark)->RangeMultiplier(8)->Range(1<<7, 1<<20);

BENCHMARK_MAIN();

which outputs

AddBenchmark/128                 30.3 ns         30.2 ns     23194091
AddBenchmark/512                  121 ns          121 ns      5758094
AddBenchmark/4096                1917 ns         1906 ns       417300
AddBenchmark/32768              25054 ns        24795 ns        28182
AddBenchmark/262144            385913 ns       382803 ns         1718
AddBenchmark/1048576          2100095 ns      2096442 ns          328
AddRangesBenchmark/128            218 ns          218 ns      3131249
AddRangesBenchmark/512            579 ns          579 ns      1169688
AddRangesBenchmark/4096          5071 ns         5069 ns       123231
AddRangesBenchmark/32768        50702 ns        50649 ns        14382
AddRangesBenchmark/262144      482216 ns       481333 ns         1288
AddRangesBenchmark/1048576    3349331 ns      3347475 ns          200

Are you compiling with optimizations enabled? Please post your compilation line — Vittorio Romeo, Jul 07 '19 at 23:11
Running with -O3. I am using Xcode, so I don't really have a compilation line, but if you ask specific options I can give them to you. Thanks! — Enzo, Jul 07 '19 at 23:13
My guess is that ranges, not being a part of standard library yet, do not benefit from some well tested performance tweaks that standard algorithms do. One thing I can think of is that `view`s are lazy and thus cannot benefit from vectorised processor instructions, but I can be wrong. EDIT: @alfC - very good call. I suggest you run some tests and probably post an answer, if it turns of that's the bottleneck — Fureeish, Jul 07 '19 at 23:18
@alfC Whops! Thanks for that. Rerun with doubles, and saw little difference there. Editing question now. — Enzo, Jul 07 '19 at 23:21
@Fureeish thanks for that. That is also an hypothesis of mine. However, the fact that the operation is lazy does not necessarily imply you cannot use vectorised operations. It just mean that the optimisation should happen at the line where the operation is actually called (in this case, at the assignment line). — Enzo, Jul 07 '19 at 23:25
I believe it *does* mean that vectorisation cannot be used *trivially*. `view`s are lazy and generate one value per `*` call, which means that values cannot benefit from vectorised operations, since there is at most one value present at a time. On the other hand, I believe that if ranges are accepted, compilers may (and should) be altered to generate efficient code for them. As of right now I think they are simply not smart enough, given the fact that the concent is new to `C++`, which, actually, is quite a shame. By the way, which assignment? I think you may've misunderstood some things. — Fureeish, Jul 07 '19 at 23:28
@Enzo: I think using this test as some kind of benchmark for range views is just not helpful. Range views are not for trivial cases. You're invoking a simple algorithm for each value and storing the results in a container. Views are for composing multiple independent operations to build complex operations *without* having to store the intermediate results in a container. Basically, nobody *should* write the view version of that code, so there's no point in testing it against the regular algorithm version. — Nicol Bolas, Jul 07 '19 at 23:38
@NicolBolas Thanks for the comment and for updating tags. I also had an example with multiple pipelined operations (5 additions and multiplications) and the gap narrowed, but range-v3 was still much slower (2x times) for low vector sizes. I thought of reporting the summation-only self-contained example here for simplicity. — Enzo, Jul 07 '19 at 23:45
@Enzo: If you're not comparing the view version to a non-view version that *has to store* intermediate values in containers, then process those values for the next operation, then you're not using views for their intended purposes. This is doubly so if you're talking about objects with more complexity than `int` (such as strings or other things that allocate memory). — Nicol Bolas, Jul 07 '19 at 23:56
@NicolBolas In the version with multiple operations, I am indeed comparing the view version with a non-view version that stores intermediate values in containers. I tested 5 operations (3 sum and 2 mult), and the performance is still worse for the rangev3 version (up to vector length 32K). — Enzo, Jul 08 '19 at 00:58
@Enzo, did you update the timings? I don't understand how to read them. I don't think the lazy view vs. eager operation was the culprit. — alfC, Jul 08 '19 at 03:56
@Enzo ok, so, at the end they are the same speed. If so, please add a note at the beginning of your question, otherwise it is confusing. — alfC, Jul 10 '19 at 08:26
@alfC I am not sure I understand.By "saw little difference there" I meant after fixing the `int` issue you pointed out. The difference between range-v3 and STD is still there, as you can see from the result. — Enzo, Jul 10 '19 at 09:39
Yep, that's right. That is 5 times slower. At 128 samples it is 7 times slower. You have to imagine that people in signal processing do this type of computation continuously. To give you an idea, getting a 10% improvement in computation time takes days of SSE/AVX/etc programming. If switching to range-v3 means a slow-down of 700% that's definitely not good. — Enzo, Jul 10 '19 at 10:15
`ranges::to>(rng1)` create a temporary vector. From https://godbolt.org/z/1dE8oxrME, there are unnecessary memory allocations when you use the function `to`. Please just use some things like `std::copy` or `ranges::copy` to test the performance again. I don't install google benchmark in hand. According to https://godbolt.org/z/1dE8oxrME, rangev3 can generate good (but not most optimal) instructions. — HarryLeong, Jul 14 '21 at 06:28

score 0 · Answer 1 · answered Nov 11 '20 at 22:31

0

(Too long for a comment)

When I try to compile this code, I get:

<source>:1468:14: error: no match for 'operator=' (operand types are 'std::vector<double>' and 'ranges::transform2_view<ranges::ref_view<std::vector<double> >, ranges::ref_view<std::vector<double> >, std::plus<void> >')
 1468 |     output = rng1;

and that's a legit error, I think. So, perhaps you mis-copy-pasted? Or do you want to be using ::to<std::vector<double>>() there?

answered Nov 11 '20 at 22:31

einpoklum

118,144
57
340
684

Thanks for looking into this. You are right--it seems like my code no longer compiles. It use to at the time I posted, with the compiler I was using at the time. I edited my question with the new line ``output = ranges::to>(rng1);`` Sadly, range-v3 remains significantly slower than the non-range version today. – Enzo Feb 28 '21 at 22:40

Why is this rangev3 implementation of vectors summation slower than the STD equivalent?

1 Answers1