Why is calculating each matrix column faster than using matrix product?

Question

#include <Eigen/Dense>
#include <ctime>
#include <iostream>

using namespace std;
using namespace Eigen;

void slow(const Matrix<std::complex<float>, 20, 1> &a, const Matrix<float, 1, 2> &b, Matrix<std::complex<float>, 20, 2> &out)
{
    out.noalias() += a * b;
}

void fast(const Matrix<std::complex<float>, 20, 1> &a, const Matrix<float, 1, 2> &b, Matrix<std::complex<float>, 20, 2> &out)
{
    for (size_t i = 0; i < 2; ++i)
    {
        out.col(i).noalias() += a * b[i];
    }
}

int main(int, const char**)
{
    clock_t start;
    Matrix<std::complex<float>, 20, 2> out;
    Matrix<std::complex<float>, 20, 1> a;
    Matrix<float, 1, 2> b;
    a = a.Random();
    b = b.Random();
    out.setZero();

    start = clock();
    const size_t N = 10000000;
    for( size_t i = 0; i < N; ++i )
    { 
        slow(a, b, out);
    }
    cout << "Matrix norm: " << out.norm() << endl;
    cout << "Slow: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << endl;

    out.setZero();
    start = clock();
    for (size_t i = 0; i < N; ++i)
    {
        fast(a, b, out);
    }
    cout << "Matrix norm: " << out.norm() << endl;
    cout << "Fast: " << (clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << endl;

    return 0;
}

I have compiled the above code using Visual Studio 2017, Eigen 3.3.7 and with the following compiler flags:

/O2 /fp:fast /arch:AVX2

Here is the output of the program:

Matrix norm: 3.07615e+07
Slow: 6707 ms
Matrix norm: 3.07615e+07
Fast: 230 ms

The function "fast" and "slow" compute the same result. Why is the "slow" function slower than the "fast" function?

Before I even test this, did you try a more recent version of Eigen? The latest stable is 3.3.7 — chtz, Dec 17 '18 at 14:13
Possible duplicate of [Eigen3 matrix multiplication performance](https://stackoverflow.com/questions/31028636/eigen3-matrix-multiplication-performance) — Daniel Langr, Dec 17 '18 at 15:11
By other words, if you measure runtime of anything that contains lazy evaluation, you need to measure also some operation that ensures that the computation has been completed (such as printing of the matrix or its norm). Otherwise, your measurements do not make sense. — Daniel Langr, Dec 17 '18 at 15:15
Briefly looking at the generated assembly, there seems to be a problem with complex-real outer products, namely there are unnecessary shuffles while it still does not get properly vectorized. Need to investigate further for this. — chtz, Dec 17 '18 at 15:16
Thank you @DanielLangr. I now print the norm of the output variable. I still get a significant performance difference. — tboerstad, Dec 17 '18 at 15:29
Your performance test is repeated `N` times. But only your last result is printed. So you have `N-1` lazy calculation and 1 evaluation. — Thomas Sablik, Dec 17 '18 at 15:29
@ThomasSablik The output variable is reused/accumulated, the output is dependent on every iteration of the loop — tboerstad, Dec 17 '18 at 15:31
@tboerstad And what are the runtimes if you completely remove lazy evaluations? Just for comparison. — Daniel Langr, Dec 17 '18 at 15:31
When testing with "g++ -O2 -ffast-math", I got 569 ms with slow and 578 ms with fast ... — Damien, Dec 17 '18 at 15:48
@chtz I changed the type of "b" from float to complex, which should mean more computation needs to happen. Then the absolute runtime of both functions improved, and the two functions are more or less on par with respect to speed — tboerstad, Dec 17 '18 at 15:50

Why is calculating each matrix column faster than using matrix product?

0 Answers0