Why is my function that returns by value slower than function that using pass_by_reference?

Question

I understand that c++ core guidelines specify that std::vector should be returned by value (in order for RVO/NRVO/move semantics to take place) as opposed to a pass by reference operation. When I tested this however with the below benchmark code it appears that the pass_by_reference function is much faster than the function that returns by value. Why is my PassByReference Multiply function so much faster than my RVOMulitply function?

I am using clang 5.0.2.

My compile line is clang++ -std=c++17 RVO_PassByReference.cpp -o RVO_PassByReference -O3 -march=native

#include <array>
#include <vector>
#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

vector<double> RVOMultiply(const vector<double>& v1, const vector<double>& v2)
{
    std::vector<double> ResultVector;
    ResultVector.reserve(v1.size());
    for (size_t i {0}; i < v1.size(); ++i)
    {
        ResultVector.emplace_back(v1[i] * v2[i]);
    }
    return ResultVector;
}

void PassByReferenceMultiply(const vector<double>& v1, const vector<double>& v2, vector<double>& Result)
{
    for (size_t i {0}; i < Result.size(); ++i)
    {
        Result[i] = v1[i] * v2[i];
    }
}

int main ()
{

    vector<double> ReferenceVector(10000);
    vector<double> Operand1Vector(10000);
    vector<double> Operand2Vector(10000);

    for (size_t i {0}; i < Operand1Vector.size(); ++i)
    {
        Operand1Vector[i] = i;
        Operand2Vector[i] = i+1;
    }

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration1 = duration_cast<nanoseconds>(t2 - t1).count();
    auto duration2 = duration_cast<nanoseconds>(t2 - t1).count();


    for (double z {0}; z < 100000; ++z)
    {
        t1 = high_resolution_clock::now();
        vector<double> RVOVector = RVOMultiply(Operand1Vector, Operand2Vector);
        t2 = high_resolution_clock::now();
        if (z != 99999)
            vector<double>().swap(RVOVector);

        duration1 += duration_cast<nanoseconds>(t2 - t1).count();


        t1 = high_resolution_clock::now();
        PassByReferenceMultiply(Operand1Vector, Operand2Vector, ReferenceVector);
        t2 = high_resolution_clock::now();
        duration2 += duration_cast<nanoseconds>(t2 - t1).count();

    }

    duration1 /= 100000;
    duration2 /= 100000;

    cout << "RVOVector Duration Average was: " << duration1 << endl;
    cout << "ReferenceVector push_back Duration Average was: " << duration2 << endl;

}

My output on my system is

RVOVector Duration Average was: 11901 ReferenceVector push_back Duration Average was: 3634

Lot of dynamic allocation in `ResultVector.reserve(v1.size());` that isn't performed by `PassByReferenceMultiply`. Functions are quite different. — user4581301, May 16 '19 at 16:19
Yes, creating the `vector ReferenceVector(10000);` outside of the time measuring stuff is not really fair. — Ted Lyngmo, May 16 '19 at 16:24
@TedLyngmo I see that now. I have changed my two functions such that they both have to reserve the space for the vector. I know see a small difference (~240 nanoseconds) between the two functions and as expected my RVO function is faster than by pass_by_reference function. Thank you both user4581301 and Ted Lyngmo. — Matthew Pittenger, May 16 '19 at 16:36
@TedLyngmo Ted, it was your comment that really got it through to me. How can I mark yours as an answer? — Matthew Pittenger, May 16 '19 at 16:41
@MatthewPittenger Note that the results from your original test are still valid though. **If** you repeatedly write to the **same** vector, it's better to reuse the existing vector than to reallocate it every time. RVO or not, overall your application will still be faster (assuming this is a hot path) if you avoid reallocating memory needlessly. — Max Langhof, May 16 '19 at 16:41
You can't and that's why people need to post these things _as answers_ — Lightness Races in Orbit, May 16 '19 at 16:41
I was working on an answer using a real online benchmarking tool - but I'm not used to it so it took too long time. :-) @MatthewPittenger You can perhaps make use of [quick-bench](http://quick-bench.com) for future benchmarkings. — Ted Lyngmo, May 16 '19 at 16:51
@MaxLanghof I understand you could avoid reallocating the memory and just use the existing vector and change the values. But if your function is returning a vector, I don't see a way around having to reserve space for the new vector? Would this just be a design thing where the developer has to know if it is better to use a pass_by_reference function within a hot path? — Matthew Pittenger, May 16 '19 at 16:56
Yes, on the hot path you might have to shape your design around such concerns. Repeatedly allocating large amounts of memory in sections that are provably performance critical is most likely a bad idea. — Max Langhof, May 16 '19 at 17:05
@MatthewPittenger, in performance-critical applications you might want that function to return a *template expression* for lazy evaluation (delayed until the ultimate destination is known). That may save you a few temporary objects. — Igor G, May 16 '19 at 18:12

Why is my function that returns by value slower than function that using pass_by_reference?

0 Answers0