2

I wrote a C++ code whose bottleneck is the diagonalization of a possibly large symmetric matrix. The code uses OpenMP, CBLAS and LAPACKE C-interfaces.

However, the call on dsyev runs on a single thread both on my local machine and on a HPC cluster (as seen by htop or equivalent tools). It takes about 800 seconds to diagonalize a 12000x12000 matrix, while NumPy's function eigh takes about 250 seconds. Of course, in both cases $OMP_NUM_THREADS is set to the number of threads.

Here is an example of a C++ code that calls LAPACK that is basically what I do in my program. (I am reading the matrix that is in binary format).

#include <stdlib.h>
#include <cstring>
#include <iostream>
#include <fstream>
#include <sstream>
#include <array>
#include <chrono>
#include <omp.h>
#include <cblas.h>
extern "C"
{
    #include <lapacke.h>
}

using namespace std;

const char uplo = 'L';
const char jobz = 'V';


int eigen(double* M, double* W, const int& N)
{

    printf("Diagonalizing...\n");

    int info;

    // Eigenvalues will be in W, Eigenvectors in T
    info = LAPACKE_dsyev(LAPACK_COL_MAJOR, jobz, uplo, N, M, N, W);

    if (info < 0)
    {
        printf("FATAL ERROR: Eigensolver (dsyev) failed: illegal argument %d\n",-info);
    }
    else if (info > 0)
    {
        printf("FATAL ERROR: Eigensolver (dsyev) failed: failed to converge %d\n",info);
    }

    return info;
}


int main(int argc, char* argv[])
{

    #ifdef _OPENMP
    if (getenv("OMP_NUM_THREADS"))
    {
        omp_set_num_threads(atoi(getenv("OMP_NUM_THREADS")));
        string str = "Running on "+to_string(omp_get_max_threads()) + " threads\n";
        printf(str.c_str());
    }
    #endif

    const int N = 12000;
    double t1 = 0.;
    double t2 = 0.;
    double* M = new double[N*N]{0.};
    double* W = new double[N]{0.};
    int info = 0;


    printf("Reading matrix.dat ...\n");
    ifstream file("matrix.dat");
    file.seekg(0, ios_base::end);
    size_t size = file.tellg();
    file.seekg(0, ios_base::beg);
    file.read((char*) &M[0], size);
    printf("Done.\n");




    t1 = omp_get_wtime();

    info = eigen(M,W,N);
    printf("Exit code: %i\n",info);

    t2 = omp_get_wtime();
    printf("Done in (s): %-10.3f\n",t2-t1);

    delete [] M;
    delete [] W;

    return 0;

}

Clearly, NumPy seems to be linked to a version of LAPACK that is multi-threaded while my program is not. I thought that any LAPACK implementation was multi-threaded in the sense that it calls BLAS, which is (supposedly) always multi-threaded.

On my local machine, np.show_config() gives the following:

blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']

While I compiled my C++ code to the LAPACK and BLAS libraries located in /usr/local/opt/lapack/lib and /usr/local/opt/openblas/lib etc...

How to fix this problem i.e. how to make sure that LAPACK uses all threads ?

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
Toool
  • 361
  • 3
  • 18
  • I think you should first check that your C code and the one of Numpy use the *same* LAPACKE library. One can use an optimized parallel implementation and not the other. This is likely the case with an Intel-optimized Python or Anaconda. On Linux, you could check that using ldd or dynamically by reading in python the files `/proc/self/maps` or `/proc/self/map_files`. – Jérôme Richard Apr 19 '21 at 15:54
  • I don't I'm using the same `LAPACKE` as I installed it manually, while `NumPy` uses its own I think? I could not do what you suggested, but I added the results of `np.show_config()` to the question. – Toool Apr 19 '21 at 16:13

1 Answers1

2

From the provided informations, it seems your C++ code is linked with OpenBLAS while your Python implementation use the Intel MKL.

OpenBLAS is a free open-souce library that implement basic linear algebra functions (called BLAS, like the matrix multiplication, the dot products, etc.), but it barely supports advanced linear algebra functions (called LAPACK, like the eigen values, QR decomposition, etc.). Consequently, while the BLAS functions of OpenBLAS are well optimized and run in parallel. The LAPACK functions are clearly not well optimized yet and are mostly running in sequential.

The Intel MKL is a non-free closed-source library implementing both BLAS and LAPACK functions. Intel claims high performance for the implementation of both BLAS and LAPACK functions (at least on Intel processors). The implementation are well optimized and most are running in parallel.

As a result, if you want your C++ code to be at least as fast as your Python code, you need to link the MKL and not OpenBLAS.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Indeed, I linked the C++ program to OpenBLAS. I wasn't aware of such a huge performance difference between BLAS implementations, especially concerning multi-threading ! Could you clarify if the OpenBLAS and MKL interfaces are the same for CBLAS and LAPACK(E) calls i.e. if I need to change the code or just the compilation / linking ? – Toool Apr 19 '21 at 20:21
  • 1
    This is a bit complex (and not something I master). AFAIK, the MKL should provides the BLAS/CBLAS interface and the LAPACK/LAPACKE ones. OpenBLAS support the CBLAS interface and partially LAPACKE through the (standard inefficient) netlib package. I am not sure OpenBLAS support the Fortran BLAS/LAPACK interfaces, but this should not be a problem with a C++ code. In your case, it should work with both by just relinking your program. You can find more information in the [MKL documentation](https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/dsyev.htm). – Jérôme Richard Apr 19 '21 at 21:34
  • 2
    Thank you for the link, I managed to write a configure script allowing to switch between compilations fairly easily. For future readers: I just replaced `#include ` by `#include `, which defines the same syntax for CBLAS and LAPACKE. Also, the `extern "C"` has to be commented or undefined when using icc. – Toool Apr 20 '21 at 07:53