0

I am benchmarking matrix multiplication for different libraries as I am thinking of rewriting some cython code to native c++. However the simple tests seem to imply that numpy is faster than BLAS or eigen for simple matrix multiplications.

I have written the following files:

#!test_blas.cpp
#include <random>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <cblas.h>

int main ( int argc, char* argv[] ) {

    // Random numbers
    std::mt19937_64 rnd;
    std::uniform_real_distribution<double> doubleDist(0, 1);

    // Create arrays that represent the matrices A,B,C
    const int n = 2000;
    double*  A = new double[n*n];
    double*  B = new double[n*n];
    double*  C = new double[n*n];

    // Fill A and B with random numbers
    for(uint i =0; i <n; i++){
        for(uint j=0; j<n; j++){
            A[i*n+j] = doubleDist(rnd);
            B[i*n+j] = doubleDist(rnd);
        }
    }

    // Calculate A*B=C
    clock_t start = clock();
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, A, n, B, n, 0.0, C, n);
    clock_t end = clock();
    double time = double(end - start)/ CLOCKS_PER_SEC;
    std::cout<< "Time taken : " << time << std::endl; 
    // Clean up
    delete[] A;
    delete[] B;
    delete[] C;

    return 0;
}

#!test_eigen.cpp
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{

    int n_a_rows = 2000;
    int n_a_cols = 2000;
    int n_b_rows = n_a_cols;
    int n_b_cols = 2000;

    MatrixXd a(n_a_rows, n_a_cols);

    for (int i = 0; i < n_a_rows; ++ i)
        for (int j = 0; j < n_a_cols; ++ j)
            a (i, j) = n_a_cols * i + j;

    MatrixXd b (n_b_rows, n_b_cols);
    for (int i = 0; i < n_b_rows; ++ i)
        for (int j = 0; j < n_b_cols; ++ j)
            b (i, j) = n_b_cols * i + j;

    MatrixXd d (n_a_rows, n_b_cols);

    clock_t begin = clock();

    d = a * b;
    clock_t end = clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Time taken : " << elapsed_secs << std::endl;

}
#!test_numpy.py

import numpy as np
import time
N = 2000

a = np.random.rand(N, N)
b = np.random.rand(N, N)
start = time.time()
c = a.dot(b)
print(f"Time taken : {time.time() - start}")

Finally I created the following test file

#!test_matrix.sh
c++ -O2 -march=native -std=c++11 -I /usr/include/eigen3 test_eigen.cpp -o eigen
c++ -O2 -march=native -std=c++11 test_blas.cpp  -o blas -lcblas

echo "testing BLAS"
./blas
echo "testing Eigen"
./eigen
echo "testing numpy"
python test_numpy.py

which yields the output

testing BLAS
Time taken : 1.63807
testing Eigen
Time taken : 0.795115
testing numpy
Time taken : 0.28397703170776367

Now my question is, how come numpy is the fastest of these tests? Am I missing something with regards to optimizations?

One thing could be that numpy uses threading to compute the matrix product. Adding the compiler flag -fopenmp however yields worse performance for eigen and BLAS.

I am using g++ version 9.0.3-1. Numpy is version 1.18.1 using python 3.8.2. Thanks in advance.

cvanelteren
  • 1,633
  • 9
  • 16
  • NumPy is pretty fast since it uses compiled c libraries under the hood. However, there can be many reasons why Eigen and BLAS are so much slower here. Without having a closer look at your code, I guess you simply did a benchmarking mistake. Correct benchmarking is extremely tricky. You need to make sure, that the conditions are really the same. However, this requires a lot of background knowledge about the stuff you are benchmarking and what optimization tricks your compiler/library/etc. uses. – wychmaster Apr 25 '20 at 11:02
  • I agree that I am probable missing some compiler optimization and or other tricks. However, what I find odd is that I would expect that even with minimal optimizations, as is done here, that the gap between the different packages would be smaller. This [post](https://stackoverflow.com/questions/51656818/benchmarking-matrix-multiplication-performance-c-eigen-is-much-slower-than) seems to find similar issues. I wonder how numpy is able to perform so well. – cvanelteren Apr 25 '20 at 11:32
  • As I said, the heavy lifting in Numpy is done with compiled C libraries. Since matrix multiplication of this size is extremely time-consuming, the overhead of the Python side is comparatively small. So you are basically benchmarking the C library NumPy is using. You should also try different matrix sizes. If you get the same ratios for all sizes, it is probably related to your compiler/library settings or to multi-threading. Maybe you should check how many cores are active during the calculation. – wychmaster Apr 25 '20 at 11:55
  • I check different sizes; smaller matrices will yield faster results in c++ than in python. As I mentioned in the post, if I add `-fopenmp` the results from both BLAS and Eigen will tank dramatically, but all cores are employed during the multiplication. I get your point I'm just arguing for the fact that these deltas are larger than I would expect. If there are optimization flags that make them on par I can't seem to find them on the package website. – cvanelteren Apr 25 '20 at 12:11
  • Well, the general problem is, that larger matrices require different algorithms due to the higher probability of cache misses. Multithreading is another thing that requires you to adjust your algorithm. I can't really tell how these problems are addressed in the different libraries. However, different algorithms can lead to large performance differences in certain size ranges, especially when cache misses are involved. – wychmaster Apr 25 '20 at 12:30
  • g++ sometimes has trouble optimizing inside the `main` function, so you should at least move the benchmarking stuff into a separate function. For Eigen, better use `d.noalias() = a * b;` to avoid an unnecessary malloc/copy. And for both C++ solutions you can try adding `-DNDEBUG` (should not make a huge difference, though). – chtz Apr 25 '20 at 14:36
  • In numpy this is just a simple BLAS call. But numpy can use different BLAS-backends (compilation dependend). This may also be the reason why it is faster. A look at `numpy.show_config()` shows which BLAS backend numpy uses (MKL,OpenBLAS,....) – max9111 Apr 28 '20 at 11:39
  • I tested it with both openblas, blas and MKL with the latter being the fastest. A colleague told me that chrono should be preferred over the – cvanelteren Apr 28 '20 at 13:41

0 Answers0