Performance differences in Eigen between auto and Eigen::Ref and concrete type

Question

I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.

I have designed a benchmark like this

static void value(benchmark::State &state) {
  for (auto _ : state) {
    const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
        Eigen::Matrix<double, 9, 1>::Random();
    auto start = std::chrono::high_resolution_clock::now();

    const Eigen::Vector3d v0 = vs.segment<3>(0);
    const Eigen::Vector3d v1 = vs.segment<3>(3);
    const Eigen::Vector3d v2 = vs.segment<3>(6);
    const Eigen::Vector3d vt = v0 + v1 + v2;
    const Eigen::Vector3d v = vt.transpose() * vt * vt + vt;

    benchmark::DoNotOptimize(v);
    auto end = std::chrono::high_resolution_clock::now();

    auto elapsed_seconds =
        std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
    state.SetIterationTime(elapsed_seconds.count());
  }
}

I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.

The results of this benchmarks are

Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
value/manual_time               23.4 ns          113 ns     29974946
ref/manual_time                 23.0 ns          111 ns     29934053
with_auto/manual_time           23.6 ns          112 ns     29891056

As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
value/manual_time               2475 ns         3070 ns       291032
ref/manual_time                 2482 ns         3077 ns       289258
with_auto/manual_time           2436 ns         3012 ns       263170

Again, the three cases behave the same.

If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0+v1+v2` operation and save it, and then perform another operation and save.

The auto case should be the fastest, as it should be skipping all the writings.

The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?

Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?

i'm curious why you're measuring time yourself instead of allowing the benchmark framework to measure it for you. — dma, Jun 30 '22 at 08:31
I am creating a Random vector, because if not the compiler see the same operation is going to be repeated all the time with the same input data and just compiles everything away. Since I have to create the random inside the loop, I dont want to measure that time — jjcasmar, Jun 30 '22 at 10:18
Pause and Resume does not solve the problem because they need to get the current time which is at least as expensive than `high_resolution_clock::now()`. AFAIK, the fastest way to get the current time on x86-64 machines is RDTSC/RDTSCP which are too expensive in this case. In fact, it is no possible to measure the time better than that without understanding how the target processor works internally. Indeed, modern processors pipeline instructions and execute them in parallel in an order governed by the availability of input data and execution units. It takes time to measure the time correctly. — Jérôme Richard, Jul 01 '22 at 18:47

score 3 · Answer 1 · answered Jun 29 '22 at 22:16

One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):

        mov     rbx, rax
        mov     rax, qword ptr [rsp + 24]
        cmp     rax, 2
        jle     .LBB0_17
        cmp     rax, 5
        jle     .LBB0_17
        cmp     rax, 8
        jle     .LBB0_17
        mov     rax, qword ptr [rsp + 16]
        movupd  xmm0, xmmword ptr [rax]
        movsd   xmm1, qword ptr [rax + 16]      # xmm1 = mem[0],zero
        movupd  xmm2, xmmword ptr [rax + 24]
        addpd   xmm2, xmm0
        movupd  xmm0, xmmword ptr [rax + 48]
        addsd   xmm1, qword ptr [rax + 40]
        addpd   xmm0, xmm2
        addsd   xmm1, qword ptr [rax + 64]
        movapd  xmm2, xmm0
        mulpd   xmm2, xmm0
        movapd  xmm3, xmm2
        unpckhpd        xmm3, xmm2                      # xmm3 = xmm3[1],xmm2[1]
        addsd   xmm3, xmm2
        movapd  xmm2, xmm1
        mulsd   xmm2, xmm1
        addsd   xmm2, xmm3
        movapd  xmm3, xmm1
        mulsd   xmm3, xmm2
        unpcklpd        xmm2, xmm2                      # xmm2 = xmm2[0,0]
        mulpd   xmm2, xmm0
        addpd   xmm2, xmm0
        movapd  xmmword ptr [rsp + 32], xmm2
        addsd   xmm3, xmm1
        movsd   qword ptr [rsp + 48], xmm3

This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.

So essentially the benchmark is bad designed, since the thing the way that I am measuring has a significant cost compared to the thing I am measuring. I will have to do a more complex operation. — jjcasmar, Jun 30 '22 at 10:20
Yes. Alternatively you can use an inner loop or just not measure the time in the inner loop as AFAIK it should be the purpose of the benchmarking tool (certainly the best solution). — Jérôme Richard, Jun 30 '22 at 10:46

Homer512 · Answer 2 · 2022-06-30T12:02:43.567

Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.

To check this, I put the code in a function, to look at the assembler:

Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
    const Eigen::Vector3d v0 = vs.segment<3>(0);
    const Eigen::Vector3d v1 = vs.segment<3>(3);
    const Eigen::Vector3d v2 = vs.segment<3>(6);
    const Eigen::Vector3d vt = v0 + v1 + v2;
    return vt.transpose() * vt * vt + vt;
}

which turns into this assembler

        push    rax
        mov     rax, qword ptr [rsi + 8]
...
        mov     rax, qword ptr [rsi]
        movupd  xmm0, xmmword ptr [rax]
        movsd   xmm1, qword ptr [rax + 16]
        movupd  xmm2, xmmword ptr [rax + 24]
        addpd   xmm2, xmm0
        movupd  xmm0, xmmword ptr [rax + 48]
        addsd   xmm1, qword ptr [rax + 40]
        addpd   xmm0, xmm2
        addsd   xmm1, qword ptr [rax + 64]
...
        movupd  xmmword ptr [rdi], xmm2
        addsd   xmm3, xmm1
        movsd   qword ptr [rdi + 16], xmm3
        mov     rax, rdi
        pop     rcx
        ret

Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.

This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.

Alternative benchmarks

Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:

struct Foo
{
public:
    Eigen::Vector3d v0;
    Eigen::Vector3d v1;
    Eigen::Vector3d v2;
    
    Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
    Eigen::Vector3d operator()() const __attribute__((noinline));
};

Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
  v1(vs.segment<3>(3)),
  v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
    const Eigen::Vector3d vt = v0 + v1 + v2;
    return vt.transpose() * vt * vt + vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
    Foo f(vs);
    return f();
}

By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.

Purpose of Ref

Ref exists for the sole purpose of providing a function interface that can accept both a full matrix/vector and a slice of one. Consider this:

Eigen::VectorXd foo(const Eigen::VectorXd&)

This interface can only accept a full vector as its input. If you want to call foo(vector.head(10)) you have to allocate a new vector to hold the vector segment. Likewise it always returns a newly allocated vector which is wasteful if you want to call it as output.head(10) = foo(input). So we can instead write

void foo(Eigen::Ref<Eigen::VectorXd> out, const Eigen::Ref<const Eigen::VectorXd>& in);

and use it as foo(output.head(10), input.head(10)) without any copies being created. This is only ever useful across compilation units. If you have one cpp file declaring a function that is used in another, Ref allows this to happen. Within a cpp file, you can simply use a template.

template<class Derived1, class Derived2>
void foo(const Eigen::MatrixBase<Derived1>& out,
         const Eigen::MatrixBase<Derived2>& in)
{
    Eigen::MatrixBase<Derived1>& mutable_out =
          const_cast<Eigen::MatrixBase<Derived1>&>(out);
    mutable_out = ...;
}

A template will always be faster because it can make use of the concrete data type. For example if you pass an entire vector, Eigen knows that the array is properly aligned. And in a full matrix it knows that there is no stride between columns. It doesn't know either of these with Ref. In this regard, Ref is just a fancy wrapper around Eigen::Map<Type, Eigen::Unaligned, Eigen::OuterStride<>>.

Likewise there are cases where Ref has to create temporary copies. The most common case is if the inner stride is not 1. This happens for example if you pass a row of a matrix (but not a column. Eigen is column-major by default) or the real part of a complex valued matrix. You will not even receive a warning for this, your code will simply run slower than expected.

The only reasons to use Ref inside a single cpp file are

To make the code more readable. That template pattern shown above certainly doesn't tell you much about the expected types
To reduce code size, which may have a performance benefit, but usually doesn't. It does help with compile time, though

Use with fixed-size types

Since your use case seems to involve fixed-sized vectors, let's consider this case in particular and look at the internals.

void foo(const Eigen::Vector3d&);
void bar(const Eigen::Ref<const Eigen::Vector3d>&);

int main()
{
    Eigen::VectorXd in = ...;
    foo(in.segment<3>(6));
    bar(in.segment<3>(6));
}

The following things will happen when you call foo:

We copy 3 doubles from in[6] to the stack. This takes 4 instructions (2 movapd, 2 movsd).
A pointer to these values is passed to foo. (Even fixed-size Eigen vectors declare a destructor, therefore they are always passed on the stack, even if we declare them pass-by-value)
foo then loads the values via that pointer, taking 2 instructions (movapd + movsd)

The following happens when we call bar:

We create a Ref<Vector> object. For this, we put a pointer to in.data() + 6 on the stack
A pointer to this pointer is passed to bar
bar loads the pointer from the stack, then loads the values

Notice how there is barely any difference. Maybe Ref saves a few instructions but it also introduces one indirection more. Compared to everything else, this is hardly significant. It is certainly too little to measure.

We are also entering the realm of microoptimizations here. This can lead to situations like this, where just the arrangement of code results in measurably different optimizations. Eigen: Why is Map slower than Vector3d for this template expression?

I tried with functions receiving Eigen::Vector3d, Ref and also with templates. Exacly same results. The point of using Ref is that, as I understand, it can work as a replacement of a template or `auto`, since it will only do perform operation if completely needed. For now I am just trying to understand how I can take advantage of this to decide if its worth changing hundreds of lines of production code — jjcasmar, Jun 30 '22 at 10:23
@jjcasmar I think you misunderstand what Ref does and doesn't do. I expanded my answer to cover this — Homer512, Jun 30 '22 at 12:03
Thanks @Homer512. I think I understand a bit better now, although I will need mucyh more time to fully understand. Could you do the same analysis with a template please? I would consider the template version will just receive a VectorBlock and it should be the fastest, no? — jjcasmar, Jun 30 '22 at 12:27
@jjcasmar Looking at the assembler, the template may actually put some more data on the stack but overall it behaves roughly like Ref. However, there is a very high chance that the template will be inlined and not cause any overhead at all. — Homer512, Jun 30 '22 at 13:14
so I guess I would have to check with my real calculations, which are more complex than this, to actually verify is stuff is being inlined and how they perform. Thanks! — jjcasmar, Jun 30 '22 at 13:19

Performance differences in Eigen between auto and Eigen::Ref and concrete type

2 Answers2

Alternative benchmarks

Purpose of Ref

Use with fixed-size types