Matrix multiplication speed BLAS/Eigen and Numpy

Question

I am benchmarking matrix multiplication for different libraries as I am thinking of rewriting some cython code to native c++. However the simple tests seem to imply that numpy is faster than BLAS or eigen for simple matrix multiplications.

I have written the following files:

#!test_blas.cpp
#include <random>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <cblas.h>

int main ( int argc, char* argv[] ) {

    // Random numbers
    std::mt19937_64 rnd;
    std::uniform_real_distribution<double> doubleDist(0, 1);

    // Create arrays that represent the matrices A,B,C
    const int n = 2000;
    double*  A = new double[n*n];
    double*  B = new double[n*n];
    double*  C = new double[n*n];

    // Fill A and B with random numbers
    for(uint i =0; i <n; i++){
        for(uint j=0; j<n; j++){
            A[i*n+j] = doubleDist(rnd);
            B[i*n+j] = doubleDist(rnd);
        }
    }

    // Calculate A*B=C
    clock_t start = clock();
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, A, n, B, n, 0.0, C, n);
    clock_t end = clock();
    double time = double(end - start)/ CLOCKS_PER_SEC;
    std::cout<< "Time taken : " << time << std::endl; 
    // Clean up
    delete[] A;
    delete[] B;
    delete[] C;

    return 0;
}

#!test_eigen.cpp
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{

    int n_a_rows = 2000;
    int n_a_cols = 2000;
    int n_b_rows = n_a_cols;
    int n_b_cols = 2000;

    MatrixXd a(n_a_rows, n_a_cols);

    for (int i = 0; i < n_a_rows; ++ i)
        for (int j = 0; j < n_a_cols; ++ j)
            a (i, j) = n_a_cols * i + j;

    MatrixXd b (n_b_rows, n_b_cols);
    for (int i = 0; i < n_b_rows; ++ i)
        for (int j = 0; j < n_b_cols; ++ j)
            b (i, j) = n_b_cols * i + j;

    MatrixXd d (n_a_rows, n_b_cols);

    clock_t begin = clock();

    d = a * b;
    clock_t end = clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Time taken : " << elapsed_secs << std::endl;

}

#!test_numpy.py

import numpy as np
import time
N = 2000

a = np.random.rand(N, N)
b = np.random.rand(N, N)
start = time.time()
c = a.dot(b)
print(f"Time taken : {time.time() - start}")

Finally I created the following test file

#!test_matrix.sh
c++ -O2 -march=native -std=c++11 -I /usr/include/eigen3 test_eigen.cpp -o eigen
c++ -O2 -march=native -std=c++11 test_blas.cpp  -o blas -lcblas

echo "testing BLAS"
./blas
echo "testing Eigen"
./eigen
echo "testing numpy"
python test_numpy.py

which yields the output

testing BLAS
Time taken : 1.63807
testing Eigen
Time taken : 0.795115
testing numpy
Time taken : 0.28397703170776367

Now my question is, how come numpy is the fastest of these tests? Am I missing something with regards to optimizations?

One thing could be that numpy uses threading to compute the matrix product. Adding the compiler flag -fopenmp however yields worse performance for eigen and BLAS.

I am using g++ version 9.0.3-1. Numpy is version 1.18.1 using python 3.8.2. Thanks in advance.

NumPy is pretty fast since it uses compiled c libraries under the hood. However, there can be many reasons why Eigen and BLAS are so much slower here. Without having a closer look at your code, I guess you simply did a benchmarking mistake. Correct benchmarking is extremely tricky. You need to make sure, that the conditions are really the same. However, this requires a lot of background knowledge about the stuff you are benchmarking and what optimization tricks your compiler/library/etc. uses. — wychmaster, Apr 25 '20 at 11:02
I agree that I am probable missing some compiler optimization and or other tricks. However, what I find odd is that I would expect that even with minimal optimizations, as is done here, that the gap between the different packages would be smaller. This [post](https://stackoverflow.com/questions/51656818/benchmarking-matrix-multiplication-performance-c-eigen-is-much-slower-than) seems to find similar issues. I wonder how numpy is able to perform so well. — cvanelteren, Apr 25 '20 at 11:32
As I said, the heavy lifting in Numpy is done with compiled C libraries. Since matrix multiplication of this size is extremely time-consuming, the overhead of the Python side is comparatively small. So you are basically benchmarking the C library NumPy is using. You should also try different matrix sizes. If you get the same ratios for all sizes, it is probably related to your compiler/library settings or to multi-threading. Maybe you should check how many cores are active during the calculation. — wychmaster, Apr 25 '20 at 11:55
I check different sizes; smaller matrices will yield faster results in c++ than in python. As I mentioned in the post, if I add `-fopenmp` the results from both BLAS and Eigen will tank dramatically, but all cores are employed during the multiplication. I get your point I'm just arguing for the fact that these deltas are larger than I would expect. If there are optimization flags that make them on par I can't seem to find them on the package website. — cvanelteren, Apr 25 '20 at 12:11
Well, the general problem is, that larger matrices require different algorithms due to the higher probability of cache misses. Multithreading is another thing that requires you to adjust your algorithm. I can't really tell how these problems are addressed in the different libraries. However, different algorithms can lead to large performance differences in certain size ranges, especially when cache misses are involved. — wychmaster, Apr 25 '20 at 12:30
g++ sometimes has trouble optimizing inside the `main` function, so you should at least move the benchmarking stuff into a separate function. For Eigen, better use `d.noalias() = a * b;` to avoid an unnecessary malloc/copy. And for both C++ solutions you can try adding `-DNDEBUG` (should not make a huge difference, though). — chtz, Apr 25 '20 at 14:36
In numpy this is just a simple BLAS call. But numpy can use different BLAS-backends (compilation dependend). This may also be the reason why it is faster. A look at `numpy.show_config()` shows which BLAS backend numpy uses (MKL,OpenBLAS,....) — max9111, Apr 28 '20 at 11:39
I tested it with both openblas, blas and MKL with the latter being the fastest. A colleague told me that chrono should be preferred over the as the current setup multiplies per core run time. After this correction eigen is within circa 10sec slower for the largest 10_000 x 10_000 matmul (22 vs 33 sec). — cvanelteren, Apr 28 '20 at 13:41

Matrix multiplication speed BLAS/Eigen and Numpy

0 Answers0