When RVO shows the maximum performance impact?

Question

I've spent a little time trying to understand is RVO performance impact is as valuable as I thought.

Here is my benchmark code (the main idea is to create big structures and return them from functions):

#include <chrono>
#include <iostream>
#include <stdlib.h>
#include <time.h>
#define SIZE_MEDIUM 128
#define SIZE_LARGE  8192
#define ITER_COUNT  100000
using namespace std;
using namespace std::chrono;

struct MediumStruct {
    int data[SIZE_MEDIUM];
};
struct LargeStruct {
    int data[SIZE_LARGE];
};

template<typename T>
T by_value(int size) {
    T rv;
    for (int i = 0; i < size; i++) {
        rv.data[i] = rand();
    }
    return rv;
}

template<typename T>
void by_ref(T &v, int size) {
    for (int i = 0; i < size; i++) {
        v.data[i] = rand();
    }
}

template<typename T>
void bench_by_value(int size, const string &suite) {
    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    int counter = 0;
    for (int i = 0; i < ITER_COUNT; i++) {
        T r = by_value<T>(size);
        for (int j = 0; j < size; j++) {
            counter += r.data[j];
        }
    }

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(t2 - t1).count();
    cout << "By value (" << suite << ") " << duration << " ms " 
         << "[stub " << counter << "]" << endl;
}

template<typename T>
void bench_by_ref(int size, const string &suite) {
    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    int counter = 0;
    for (int i = 0; i < ITER_COUNT; i++) {
        T r;
        by_ref<T>(r, size);
        for (int j = 0; j < size; j++) {
            counter += r.data[j];
        }
    }

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(t2 - t1).count();
    cout << "By ref (" << suite << ") " << duration << " ms " 
         << "[stub " << counter << "]" << endl;
}

int main() {
    srand(time(NULL));
    bench_by_value<MediumStruct>(SIZE_MEDIUM, "MEDIUM");
    bench_by_value<LargeStruct>(SIZE_LARGE, "LARGE");
    bench_by_ref<MediumStruct>(SIZE_MEDIUM, "MEDIUM");
    bench_by_ref<LargeStruct>(SIZE_LARGE, "LARGE");
}

On my MacBook Pro this benchmark shows pretty the same performance with and without RVO optimization.

$g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

$g++ -std=c++11 -S -o /dev/stdout main.cpp | grep memcpy | wc -l
       0

$g++ -std=c++11 -fno-elide-constructors -S -o /dev/stdout main.cpp | grep memcpy | wc -l
       4

$ g++ -std=c++11 main.cpp
$ ./a.out
By value (MEDIUM) 123 ms [stub -1593461165]
By value (LARGE) 7741 ms [stub -1299931242]
By ref (MEDIUM) 120 ms [stub -1955762550]
By ref (LARGE) 7835 ms [stub 911248215]
$ ./a.out
By value (MEDIUM) 118 ms [stub 2094615909]
By value (LARGE) 7840 ms [stub -1276864738]
By ref (MEDIUM) 118 ms [stub -223778153]
By ref (LARGE) 7890 ms [stub -381745773]

$ g++ -std=c++11 -fno-elide-constructors main.cpp
$ ./a.out
By value (MEDIUM) 122 ms [stub 1921645226]
By value (LARGE) 8078 ms [stub 1869896539]
By ref (MEDIUM) 123 ms [stub -676968691]
By ref (LARGE) 7855 ms [stub 1621698360]
$ ./a.out
By value (MEDIUM) 119 ms [stub 954834819]
By value (LARGE) 7779 ms [stub 98742842]
By ref (MEDIUM) 118 ms [stub 1498384025]
By ref (LARGE) 7505 ms [stub -118604845]

Having such results one might tend to always use returning by value technique since it looks better semantically even if a compiler doesn't provide RVO. When RVO really gives significant performance impact?

Link to the benchmark gist.

`SIZE_LARGE` isn't particularly large and neither of your structures under test have user-defined copy or move constructors so it doesn't surprise me that there is no huge benefit with RVO in your particular case. — CB Bailey, Aug 19 '16 at 10:49
@CharlesBailey, ok, user-defined constructors could be heavy. Let's pretend for a moment that there is no such ctrs. Is it a common practice to allocate such a big objects on the stack? Is 8 kB a big object? If not, what is reasonable size for really big objects? — Ivan Velichko, Aug 19 '16 at 11:09
@CharlesBailey just have tested on 64kB structs and got 66.088s vs 65.156s result. Still looks pretty the same while the size is 8 times bigger. — Ivan Velichko, Aug 19 '16 at 11:11
I would look at multi-megabyte sized objects, although having said that, the calls to rand will be scaling linearly with object size and may well dominate the copy-on-return cost in any case. — CB Bailey, Aug 19 '16 at 12:18
@CharlesBailey, nice point about `rand()`, thank you! Gonna replace `rand()` with `memset()` and try again. — Ivan Velichko, Aug 19 '16 at 13:00
@CharlesBailey, after replacing `rand()` with `memset()` difference in the execution time is about 20% for 64KB structs. 16.048s vs 19.949s. Nice advice, thanks again. Would you mind write an answer which I could then accept? — Ivan Velichko, Aug 19 '16 at 13:12

Konstantin Vladimirov · Accepted Answer · 2016-08-20T10:42:07.570

There are two things, you shall know about RVO:

1. RVO is not (regular) compiler optimization

Sounds crazy because last O stands for "optimization", but let me explain. Regular compiler optimizations preserve semantics of your code: at least the order of side effects and dependencies between them. RVO do not bother such things. Example:

#include <cstdio>
using namespace std;
struct foo {
  foo () { printf ("foo::foo()\n"); }
  foo (const foo&) { printf ("foo::foo( const foo& )\n"); }
  ~foo () { printf ("foo::~foo()\n"); }
};

foo bar() {
  foo local_foo;
  return local_foo;
}

int main() {
  foo f = bar();
  return 0;
}

What do you expect on screen? Without RVO:

$ g++ -O2 rvo.cc -fno-elide-constructors
$ ./a.out

foo::foo()
foo::foo( const foo& )
foo::~foo()
foo::foo( const foo& )
foo::~foo()
foo::~foo()

With RVO:

$ g++ -O2 rvo.cc
$ a.out

foo::foo()
foo::~foo()

No compiler ever will optimize user code in such way. But RVO is really more then optimization: it is prescription for compiler to be blind in some cases. Blind to user-defined semantics, I mean.

2. RVO is extremely effective

Your benchmarks just not exploit this because you do not do things that RVO elides. You do things that any normal compiler may optimize. But lets dance around return value copying:

class MyBuf {
  int x_;
  char *buf_;
public:
  MyBuf(int x) : x_(x), buf_(new char[x_]) {}
  MyBuf(const MyBuf &rhs) :
    x_(rhs.x_), buf_(new char[x_])
  {
    memcpy (buf_, rhs.buf_, x_);
  }
  ~MyBuf() { delete [] buf_; }
  char &operator[](int x) { return buf_[x]; }
};

Now:

MyBuf
retvec()
{
  MyBuf v(10000);
  v[0] = 1;
  return v;
}

And wow:

for (idx = 0; idx != 1000000; ++idx)
  {
    MyBuf v = retvec();
    cnt += v[0];
  }
fprintf (stderr, "cnt is %d\n", cnt);

You shall see:

$ g++ --std=c++11 -O2 rvo-spec.cc -fno-elide-constructors
$ time ./a.out 
cnt is 1000000

real    0m1.284s
user    0m1.280s
sys 0m0.000s

$ g++ --std=c++11 -O2 rvo-spec.cc
$ time ./a.out 
cnt is 1000000

real    0m0.099s
user    0m0.096s
sys 0m0.000s

More then 12 times difference for very simple scenario.

I hope I created some feeling of RVO and why you shall never switch it off. Good luck with it!

When RVO shows the maximum performance impact?

1 Answers1

1. RVO is not (regular) compiler optimization

2. RVO is extremely effective