Eigen methods in C++ library (OpenCV, Eigen, Armadillo) considerably slower than MATLAB

Question

I am trying to compute Eigen vectors for a matrix in C++ in a very efficient way. The problem is that the most representative C++ libraries OpenCV, Eigen and Armadillo are considerably slower than MATLAB equivalent eigen function. This is devastating and very concerning for me, I would never expect that MATLAB could beat C++ code performance, specially if it is a very well known/used library like the aforementioned ones. To give an idea of the performance difference, below find all the computation times for eigen vector computation in different libraries (units in seconds, some of them are in hours and days! yes days!) (all of them give me the expect results including MATLAB, it just the execution time is vastly difference)

The unfortunate part is that for my purposes I need to compute the Eigen values for a matrix whose dimensions are (8192, 8192). From the table you can see that MATLAB only takes 17 secs, and the second best (armadillo) takes 47 secs, might not sound like a big deal, but I need to repeat this operation thousands of times, so you can imagine how this will add up time quickly. I would tremendously appreciate if someone can tell me what I am doing wrong or this is just the sad reality I have to face and just have slow C++ code (at least slower than MATLAB) (by the way, props to the MATLAB coders that manage to have the eig function considerably faster than any other C++ library). For those who are interested in looking at the code and way how I compute the eigen vectors, I will leave it down here.

PD: All these methods were tested using release builds for the libraries (including libopenblas.dll for armadillo, excepting eigen that is a header library)

PD: All these computation times were obtained using same computer

// COV is a covariance matrix, meaning is a symmetric matrix
// has just a bunch of double numbers (CV_64FC1).
// For those interesting this covariance matrix COV is just
// the result of this operation COV = X' * X, where M is just literally 
// any double (CV_64FC1) matrix dimensions (m, 8192), m can be any positive 
// value to be honest.

cv::Mat X = read_opencv_matrix()   // dim(X) = (8192, m)

cv::Mat COV, MEAN; 

// dim(COV) = (8192, 8192)
cv::calcCovarMatrix(X.t(), COV, MEAN, cv::COVAR_NORMAL | cv::COVAR_ROWS);

int numRows = X.rows;       // should be 8192
int numCols = X.cols;       // can be anything to be honest


// computing eigen values using different libraries (opencv, armadillo, eigen)
// all of them give me the same results (including MATLAB function eig())
// the problem is that all of them are considerably slower than MATLAB

///////////////////////////////// OPENCV //////////////////////////
// opencv + eigen
if (do_opencv_eigen)
{
    cv::Mat D, V;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    cv::eigen(cov, D, V);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] OpenCV + cv::eigen = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

// opencv + SVD
if (do_opencv_eigen)
{
    cv::Mat u, w, vt;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    cv::SVD::compute(cov, w, u, vt);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] OpenCV + cv::SVD = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

// opencv + SVD + MODIFY_A flag
if (do_opencv_svd_mod_a)
{
    cv::Mat u2, w2, vt2;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    cv::SVD::compute(cov, w2, u2, vt2, cv::SVD::MODIFY_A);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] OpenCV + cv::SVD::MODIFY_A = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}


///////////////////////// ARMADILLO /////////////////////////////
arma::mat arma_cov = Utils::opencv2arma(cov);       // helper function to convert cv::mat to arma::mat
arma::mat arma_X = Utils::opencv2arma(X);
arma::mat arma_col_mean_rep = Utils::opencv2arma(col_mean_rep);

// compute arma eigen gen vectors
if (do_arma_eigen_gen)
{
    arma::cx_vec arma_Dc;
    arma::cx_mat arma_Vc;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    arma::eig_gen(arma_Dc, arma_Vc, arma_cov);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Arma + arma::eig_gen = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

// compute arma eigen gen vectors
if (do_arma_eigen_sym)
{
    arma::vec arma_D;
    arma::mat arma_V;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    arma::eig_sym(arma_D, arma_V, arma_cov);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Arma + arma::eig_sym = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

// armadillo + svd
if (do_arma_svd)
{
    arma::mat arma_U2;
    arma::vec arma_s2;
    arma::mat arma_V2;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    arma::svd(arma_U2, arma_s2, arma_V2, arma_cov);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Arma + arma::svd = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

// armadillo + svd + econ
if (do_arma_svd_econ)
{
    arma::mat arma_U2_econ;
    arma::vec arma_s2_econ;
    arma::mat arma_V2_econ;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    arma::svd_econ(arma_U2_econ, arma_s2_econ, arma_V2_econ, arma_cov);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Arma + arma::svd_econ = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

/////////////////// EIGEN /////////////////////////////

Eigen::Matrix eig_cov = Utils::opencv2eigen(cov);       // helper function to convert cv::mat to Eigen::Matrix
Eigen::Matrix eig_X = Utils::opencv2eigen(X);
Eigen::Matrix eige_col_mean_rep = Utils::opencv2eigen(col_mean_rep);

//Eigen general eigen function
if (do_eigen_eigen)
{
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    Eigen::EigenSolver<Eigen::MatrixXd> es(eig_cov);
    Eigen::MatrixXcd eig_VC = es.eigenvectors();
    Eigen::MatrixXd eig_V = eig_VC.real();
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Eigen + Eigen::EigenSolver = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

// eigen library + SVD
if (do_eigen_SVD)
{
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    Eigen::BDCSVD<Eigen::MatrixXd> SVD(eig_cov, Eigen::ComputeThinU | Eigen::ComputeThinV);
    Eigen::MatrixXd eig_V2 = SVD.matrixV();
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Eigen + Eigen::BDCSVD = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

//Eigen library + SelfAdjointEigenSolver
if (do_eigen_SelfAdjoin)
{
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    Eigen::SelfAdjointEigenSolver<Eigen::MatrixXd> esa;
    esa.compute(eig_cov);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
    std::cout << "[TIME] Eigen + Eigen::SelfAdjointEigenSolver = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}

UPDATE 1 OS : Windows 10 IDE : Visual Studio 2022 CPU : Intel Xeon CPU E5 2687W v4 OpenCV Version: 4.5.3 (compiled with the official intrusction from opencv https://docs.opencv.org/4.x/d3/d52/tutorial_windows_install.html) Armadillo: uses the default libopenblas.dll that comes with it Eigen: I downloaded the lastest and greatest version up today

Which compiler on which OS did you use, which compiler flags, which library versions? — Sedenion, Feb 12 '23 at 08:27
Your praises shouldn't go to Matlab but to Intel MKL which is used by Matlab. When you use Eigen with the [MKL backend](https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html) you get the same performance. Make sure to compile with one of the parallelized implementations. If you see only one CPU core used for the majority, fix your linker and compile flags — Homer512, Feb 12 '23 at 10:03
If you run an AMD CPU, you might instead want to try out AMD's [AOCL](https://www.amd.com/en/developer/aocl.html) library. In particular AOCL-BLIS as a BLAS backend and AOCL-libFLAME as LAPACK. See [here](https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html) how to use BLAS and LAPACK backends with Eigen — Homer512, Feb 12 '23 at 10:09
the difference isn't one of languages/compilers/multithreading but of **numerical algorithms** (if you don't know runtime complexity, big-O, you don't know what's going on here). MATLAB is used in academia, so they are aware of the state of the art in numerical algorithms. Intel MKL is similarly developed by smart cookies, or at least people who know who to ask. — Christoph Rackwitz, Feb 12 '23 at 12:37
“Why is this expensive tool faster than these tools I can download for free?” — Cris Luengo, Feb 12 '23 at 15:26
@CrisLuengo that's missing the point entirely. MKL is free to use but not open-source — Homer512, Feb 12 '23 at 17:26
it's the option with the most commercial/academic backing. that is the point. community efforts are a lot less likely to operate at the bleeding edge of research (numerical algorithms that aren't a couple decades old). I too would like to know what algorithms MKL uses, compared to Eigen and Armadillo, but that's just passing curiosity. to truly appreciate those details, one needs a graduate-level understanding of this field of math, which I neither have nor would have the inclination to get. — Christoph Rackwitz, Feb 12 '23 at 17:32
@Homer512 Your comment suggests you don’t know what MATLAB’s `eig` does. It is not just about calling a function in an highly optimized library, it is about knowing which algorithm to use for any given input matrix. MATLAB does that for you, Eigen and Armadillo have you pick the algorithm yourself. — Cris Luengo, Feb 12 '23 at 19:09
@CrisLuengo In what way do you have to pick the algorithm in Eigen? Just link against MKL as indicated in the documentation and invoke the exact same routine you would always do. — Homer512, Feb 12 '23 at 21:36
@Homer512 there are many different ways to compute the eigendecomposition, and different methods are much more efficient on specific matrices than others, and some are suitable only for some types of matrices. Picking the most efficient algorithm in each case is not trivial. — Cris Luengo, Feb 12 '23 at 22:39
@Homer512 could you please point me to how to compile with the parallized implementations? I am using this CPU Intel Xeon CPU E5 2687W v4. Do you think it is possible to use Armadillo with Intel MKL? — zeellos, Feb 13 '23 at 02:56
@Homer512 could you elaborate more in this comment? "Make sure to compile with one of the parallelized implementations. If you see only one CPU core used for the majority, fix your linker and compile flags" Downloaded latest version of Intel MKL and latest version of Eigen (3.4.0) I am using visual studio 2022 with C++17 I added the line #define EIGEN_USE_BLAS and also including #include but I get the following error Intel MKL ERROR: Parameter 6 was incorrect on entry to SGEMV (I used the link advisor to link the corresponding libraries) — zeellos, Feb 13 '23 at 06:45

Homer512 · Answer 1 · 2023-02-13T09:43:17.983

2

Turning my comments into an answer now that a more complete understanding of the platform is available:

Your praises shouldn't go to Matlab but to Intel MKL which is used by Matlab. When you use Eigen with the MKL backend you get the same performance, at least for straightforward LAPACK calls. Make sure to compile with one of the parallelized implementations. If you see only one CPU core used for the majority, fix your linker and compile flags.

In particular for Visual Studio 2022, and, I assume OneMKL 2023, the compile flags taken from MKL's Link Line Advisor and Eigen's documentation look like this.

For compilation: -I"%MKLROOT%\include" -DEIGEN_USE_MKL_ALL

Link line: mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib tbb12.lib

Alternatively you may use this line for OpenMP parallelization instead of TBB: mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib

No further changes such as including <Eigen/src/Core/util/MKL_support.h> should be necessary.

Compile errors

I get the following error Intel MKL ERROR: Parameter 6 was incorrect on entry to SGEMV (I used the link advisor to link the corresponding libraries)

I believe that error (or similarly worded ones with other function names) occur when you link with the ILP64 interface of MKL instead of the LP64 interface. The wording in the Eigen documentation could be clearer in this regard. On the link advisor, use "Select interface layer: C API with 32 bit integer"

Performance comparisons

Whether performance is exactly the same depends on whatever else Matlab decides to do. To the best of my knowledge its exact choice of decomposition is not documented. Eigen simply calls the syev family of functions, as far as I can tell. On my system (Linux with Intel i7-11800H), this computes the selfadjoint 2048x2048 decomposition in 0.6 seconds. So it is at least in the same ballpark and likely the exact same thing.

Alternatives

MKL is known to optimize primarily for Intel and may artificially cripple AMD. Something that has affected Matlab as well. On AMD CPUs you may use the plain old LAPACKE and BLAS interface provided by AMD's AOCL library. In particular AOCL-BLIS as a BLAS backend and AOCL-libFLAME as LAPACK.

Do you think it is possible to use Armadillo with Intel MKL?

According to its FAQ, yes. Link against MKL as its LAPACK library.

edited Feb 13 '23 at 09:43

answered Feb 13 '23 at 09:37

Homer512

9,144
2
8
25

first of all, I truly appreciate comments like yours with tons of information learnings. I followed your instructions on how to enable MKL in visual studio I was able to run both way of linking different libraries using TBB and OpenMP. For a matrix of 8912x8912, using Eigen::SelfAdjointEigenSolver, with TBB, I was able to reduce from 11 hours to 50s (unbelievable!) , using OpenMP all the way down to 40s. However, they are still far from MATLAB (17s), still somewhere x2-3 times slower for my purposes (unfortunately this times make my code unshippeable) (1/2) – zeellos Feb 16 '23 at 02:55
what it is fascinating is that with Armadilllo and libopenblas.dll I get pretty much the same time (40ish seconds) for a matrix of 8192x8192, is there any further tweaks that you may know or any other library that can beat MATLAB eig function? (2/2) – zeellos Feb 16 '23 at 02:56
@zeellos What shape does your original matrix `X` have? – Homer512 Feb 16 '23 at 07:09
X = 19x8192 (double or in OpenCV is CV_64FC1) then I compute the covariance matrix by doing COV = (1/18) * X.t() * X , this gives me a dimension for COV of 8192x8192, and I apply eig of this covariance matrix (MATLAB computes eigen vectors in 17s and the best I could do in c++ is ~40s with the combination suggested Eigen library + MKL ) – zeellos Feb 16 '23 at 10:53
@zeellos I was asking about the `X` size because of [the answer here](https://ch.mathworks.com/matlabcentral/answers/262276-how-can-i-increase-the-calculation-speed-in-using-eig). I haven't explored this yet (and my testing would be limited to C++, no Matlab) but I wonder if the performance is data-dependent (closer results if `X` is 8192x8192) or whether you can speed up both the C++ and Matlab version with an SVD. – Homer512 Feb 16 '23 at 18:25
Thanks for the reference @Homer512 that is a great source, and kind of similar situation to what I have. Unfortunately I have already tried OpenCV + SVD (including SVD with Econ) (see table above), I didn't have luck, but maybe i am missing something in the runs, you can even see my code in my original question, whenever you have a time to try it I'd appreciate it if you have better luck with getting the eigen vectors of a covariance matrix (8192x8192) , In my case I need the first 100 eigen vectors (even though my original data X is (8192,19) ) – zeellos Feb 18 '23 at 02:37
@Homer512 I am trying to do a similar thing albeit using MKL 2022, everything compiles fine but I see no speed up compared to Eigen built without MKL. Any other gotchas? [my question] (https://stackoverflow.com/q/75857475/4655624) – Ben238 Mar 27 '23 at 17:11

Eigen methods in C++ library (OpenCV, Eigen, Armadillo) considerably slower than MATLAB

1 Answers1

Compile errors

Performance comparisons

Alternatives

Linked