Numpy matrix inverse faster than CLAPACK with Accelerate framework

Question

I tried to do some experiment about matrix inverse to understand the performance of matrix algorithm on different programming platform. For a 4000x4000 positive definite real matrix, the

import time
tic = time.perf_counter()
sts4098_inverse = np.linalg.inv(sts4098)
toc = time.perf_counter()
print(f"Numpy inversion in {toc - tic:0.4f} seconds")

Numpy inversion in 0.9607 seconds

uses around 1 second. I tried to use the CLAPACK interface contained in Apple's Accelerate framework like that

cout << "Using LAPACK to perform Cholesky Inversion... " << endl;
char* UPLO = "U";
int N, LDA, ret;
N = n;
LDA = n;
int len_chol = n * (n + 1) / 2;
double* ap = (double*)malloc(len_chol * sizeof(double));
system_clock::time_point t1, t2;
t1 = system_clock::now();
ret = dpotrf_(UPLO, &N, a, &LDA, &ret);
int ptr = 0;
// Column major expression in LAPACK!
for (int i = 0; i < n; i++) {
    for (int j = 0; j <= i; j++) {
        ap[ptr] = a[n * i + j];
        ptr++;
    }
}
ret = dpptri_(UPLO, &N, ap, &ret);

t2 = system_clock::now();
duration<double, milli> ms = t2 - t1;
double time = ms.count() / 1000;
cout << fixed;
cout << "LAPACK Cholesky Inversion takes " << time << "s" << endl;
return ap;

LAPACK Cholesky Inversion takes 5.539074s

And the algorithm uses more than 5s according to chrono::system_clock. I think this result is impossible since I used the compiling option "-Ofast", "-mcpu=apple-m1". My laptop is an M1 Max model.

Is there anything that I missed? How is it possible that numpy is that much faster?

@kiner_shah Hi! I compiled with -Ofast. If compiled with -O3 the c++ program was even a bit slower. — Slangevar, Feb 17 '23 at 09:52
I found this function which computes inverse of a matrix using lapack: https://netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga56d9c860ce4ce42ded7f914fdb0683ff.html — kiner_shah, Feb 17 '23 at 09:58
What results do you get when you use ``chrono::steady_clock`` (or ``chrono::high_resolution_clock``) ? — Christian Halaszovich, Feb 17 '23 at 10:03
@kiner_shah Hi! This matrix inverse function uses LU decomposition which is supposed to be slower than Cholesky decomposition and its corresponding inverse method https://netlib.org/lapack/explore-html/da/dba/group__double_o_t_h_e_rcomputational_gae50ca6e928ba3eb2917521a6c886a41b.html (?) — Slangevar, Feb 17 '23 at 10:03
@ChristianHalaszovich The result is almost identical, around 5s. — Slangevar, Feb 17 '23 at 10:05
Excepting the copy loop, Numpy should do similar calls. The main difference should be the library used. Numpy use LAPACK internally with OpenBLAS by default on most platforms. The performance difference certainly from from the different library being used. Compiler optimizations can only speed up the copy loop. By the way, please consider putting the code in the loop to see if the first iteration is slower (for example due to page faults). — Jérôme Richard, Feb 17 '23 at 10:29
@ChristianHalaszovich [`high_resolution_clock` should be avoided according to the author of `chrono`](https://stackoverflow.com/a/37440647/10107454). Use `steady_clock` for benchmarking. — paleonix, Feb 17 '23 at 10:34
@JérômeRichard Hi! I looped over the same LAPACK code and the execution time each time is approximately the same, around 5s. I tried to figure out which version of BLAS my Numpy used and I came up with an issue like this question: https://stackoverflow.com/questions/74913930/how-is-numpy-working-with-a-null-openblas-installation . I am very curious about, in this sense, which version of BLAS that Numpy actually uses here. — Slangevar, Feb 17 '23 at 11:34
If you are unsure, the best is to use a low-level sampling profiler so to track C calls and find out which function call of which library is actually used in both cases. This will certainly answers the question. — Jérôme Richard, Feb 17 '23 at 17:06
Behind numpy's linalg.inv, there is dgesv see https://github.com/numpy/numpy/blob/v1.24.0/numpy/linalg/umath_linalg.cpp on line 1794+ . DGESV computes the inverse by mean of the PLU decomposition (dgetrf+dgetrs). I would have expected the Cholesky (dpotrf+dpptri) to by faster as it handles a symetric matrix https://www.mathworks.com/matlabcentral/fileexchange/34511-fast-and-accurate-symmetric-positive-definite-matrix-inverse-using-cholesky-decomposition?s_tid=FX_rc2_behav . The matrix sts4098 is sparse... Could you try the Cholesky inversion using Python or the PLU inversion using CLAPACK? — francis, Feb 20 '23 at 08:54

Numpy matrix inverse faster than CLAPACK with Accelerate framework

0 Answers0