How to improve GEMM performance on data-mapped (Eigen::Map) matrices sharing memory with an std::vector?

Question

When multiplying two data-mapped matrices (Eigen::Map) I notice a significant performance difference depending on how the memory was allocated. When using memory coming from a custom allocation, it's almost twice as fast compared to using (also aligned) memory coming from an std::vector with data allocated also by Eigen::aligned_allocator.

Minimal benchmark:

#include <Eigen/Core>
#include <Eigen/StdVector>

#include <chrono>
#include <iostream>

using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
using Mapped = Eigen::Map<Matrix, Eigen::Aligned16>;
using aligned_vector = std::vector<float, Eigen::aligned_allocator<float>>;

void measure(const std::string& name, const Mapped& a, const Mapped& b, Mapped& c)
{
    using namespace std::chrono;
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const std::size_t runs = 10;
    for (size_t i = 0; i < runs; ++i)
    {
        c.noalias() = a * b;
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << name << ": " << elapsed_ms << " ms" << std::endl;
}

int main()
{
    unsigned int size_1 = 1;
    unsigned int size_2 = 8192;
    unsigned int size_3 = 16384;

    aligned_vector a_vec(size_1 * size_2);
    aligned_vector b_vec(size_2 * size_3);
    aligned_vector c_vec(size_1 * size_3);
    Mapped a_mapped_vec(a_vec.data(), size_1, size_2);
    Mapped b_mapped_vec(b_vec.data(), size_2, size_3);
    Mapped c_mapped_vec(c_vec.data(), size_1, size_3);
    measure("Mapped vector memory", a_mapped_vec, b_mapped_vec, c_mapped_vec);

    Eigen::aligned_allocator<float> allocator;
    float* a_mem = allocator.allocate(size_1 * size_2);
    float* b_mem = allocator.allocate(size_2 * size_3);
    float* c_mem = allocator.allocate(size_1 * size_3);
    Mapped a_mapped_mem(a_mem, size_1, size_2);
    Mapped b_mapped_mem(b_mem, size_2, size_3);
    Mapped c_mapped_mem(c_mem, size_1, size_3);
    measure("Mapped custom memory", a_mapped_mem, b_mapped_mem, c_mapped_mem);
    allocator.deallocate(a_mem, size_1 * size_2);
    allocator.deallocate(b_mem, size_2 * size_3);
    allocator.deallocate(c_mem, size_1 * size_3);
}

Output on my machine (Core i5-6600):

Mapped vector memory: 661 ms
Mapped custom memory: 370 ms

Dockerfile to quickly reproduce the effect:

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y build-essential cmake git wget

RUN git clone -b '3.3.7' --single-branch --depth 1 https://github.com/eigenteam/eigen-git-mirror && cd eigen-git-mirror && mkdir -p build && cd build && cmake .. && make && make install && ln -s /usr/local/include/eigen3/Eigen /usr/local/include/Eigen

RUN wget https://gist.githubusercontent.com/Dobiasd/4b80aa0d5d19f8112656794ab94a061b/raw/c9cca8abc16ab35e71070aed5e779c7a8ebb3a7e/main.cpp
RUN g++ -std=c++14 -O3 -march=native main.cpp -o main

ADD "https://www.random.org/cgi-bin/randbyte?nbytes=10&format=h" skipcache
RUN ./main

Why is there such a difference? (I'd think Eigen would not know where the memory comes from.)

And even more important for me, how can I improve the performance on the memory coming from the std::vector?

I edited this to clarify the problem statement... can you check that the edit reflects what you meant? This is a very Interesting, well-stated question. — NicholasM, May 10 '20 at 06:23
@NicholasM Yes, your edit reflects what I meant. Thanks for fixing it. — Tobias Hermann, May 10 '20 at 06:56
Aligned_allocator does presumably not initialize the memory. Vector always initializes its memory — PeterT, May 10 '20 at 16:43
`allocator.allocate` returns uninitialized memory, and reading from uninitialized memory is UB -- probably the MMU decides to just give zero values, without ever having to access actual memory. You should at least initialize both input types to get a better comparison. — chtz, May 10 '20 at 16:43
Thanks a lot, PeterT and chtz. You're absolutely correct, this is the reason. — Tobias Hermann, May 10 '20 at 17:01
One small note, according to Eigen’s [documentation of its enumerations](https://eigen.tuxfamily.org/dox/group__enums.html#ga45fe06e29902b7a2773de05ba27b47a1), the constant `Eigen::Aligned` is a deprecated synonym of `Aligned16`, which is probably understating the actual alignment. — NicholasM, May 10 '20 at 18:01
Thanks. I've edited my question accordingly to avoid spreading bad practices to future readers. :) — Tobias Hermann, May 10 '20 at 18:46

Tobias Hermann · Accepted Answer · 2020-05-10T19:17:41.563

As pointed out in the comments by PeterT and chtz, the manually allocated version does not initialize the memory (in contrast to std::vector), accessing it is undefined behavior, and thus the MMU likely does something smart, i.e., not actually accessing the memory.

When also initializing the memory in the second part, both versions show similar performance:

    float* a_mem = allocator.allocate(size_1 * size_2);
    memset(a_mem, 0, size_1 * size_2 * sizeof(float));
    float* b_mem = allocator.allocate(size_2 * size_3);
    memset(b_mem, 0, size_2 * size_3 * sizeof(float));
    float* c_mem = allocator.allocate(size_1 * size_3);
    memset(c_mem, 0, size_1 * size_3 * sizeof(float));

Mapped vector memory: 654 ms
Mapped custom memory: 655 ms

How to improve GEMM performance on data-mapped (Eigen::Map) matrices sharing memory with an std::vector?

1 Answers1