Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

Question

I am trying to estimate how good is Python performance comparing to C++.

Here is my Python code:

a=np.random.rand(1000,1000) #type is automaically float64
b=np.random.rand(1000,1000) 
c=np.empty((1000,1000),dtype='float64')

%timeit a.dot(b,out=c)

#15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And here is my C++ code that I compile with Xcode in release regime:

#include <iostream>
#include <Dense>
#include <time.h>

using namespace Eigen;
using namespace std;

int main(int argc, const char * argv[]) {
    //RNG generator
    unsigned int seed = clock();
    srand(seed);

    int Msize=1000, Nloops=10;

    MatrixXd m1=MatrixXd::Random(Msize,Msize);
    MatrixXd m2=MatrixXd::Random(Msize,Msize);
    MatrixXd m3=MatrixXd::Random(Msize,Msize);

    cout << "Starting matrix multiplication test with " << Msize << 
    "matrices" << endl;
    clock_t start=clock();
    for (int i=0; i<Nloops; i++)
        m3=m1*m2;
    start = clock() - start;

    cout << "time elapsed for 1 multiplication: " << start / ((double) 
CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl;
return 0;

}

And the result is:

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.148856 seconds
Program ended with exit code: 0

Which means that C++ program is 10 times slower.

Alternatively, I've tried to compile cpp code in MAC terminal:

g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3

./my_exec

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.150432 seconds

I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution.

Thanks, Mikhail

Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by Eigen::nbThreads() function). As suggested, I used -march=native option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.

It is highly unlikely Python will have better performance than C++. — Ron, Aug 02 '18 at 15:06
Mikhail, did you check whether `numpy` possibly uses multithreading or GPU offloading? I doubt `Eigen` does it by default, but in Python it wouldn't surprise me. — Max Langhof, Aug 02 '18 at 15:14
@RushabhMehta it is said in eigen documentation that the performance is very close to optimal, e.g. BLAS — Mikhail Genkin, Aug 02 '18 at 15:16
@MaxLanghof thanks for the useful comment. As far as I understand, Eigen should also do multithreading. Even if doesn't, the performance shouldn't be 10x times worse, since I only ahve 4 cores and 8 threads. Going to check on GPU — Mikhail Genkin, Aug 02 '18 at 15:17
For more information about the underlying library used by numpy, show the output of `np.show_config()`. — Warren Weckesser, Aug 02 '18 at 15:19
I don't think it would matter but in the C++ code you randomize `m3`. What happens if you don't and leave it empty like you do in python? — NathanOliver, Aug 02 '18 at 15:22
@WarrenWeckesser thanks for the suggestion. I can see that I definetly use mkl (installed it a while ago) for numpy. Didn't see any trace of GPU. In addition, on My Mac I only have Radeon GPU, while the nvidia one is required — Mikhail Genkin, Aug 02 '18 at 15:22
@NathanOliver I've just checked - the test time is essentially the same (as expected) — Mikhail Genkin, Aug 02 '18 at 15:23
@MikhailGenkin OK. I figured as much but wanted to make sure. — NathanOliver, Aug 02 '18 at 15:24
I don't think pthon `dot` does what you think it does and that is why it is faster. — Marek R, Aug 02 '18 at 15:25
@MarekR numpy.dot multiplies matrix by matrix. I am 100% sure, as I have some python experience — Mikhail Genkin, Aug 02 '18 at 15:28
See my answer I run simple example and it doesn't multiply matrix, but multiplication operator does. — Marek R, Aug 02 '18 at 15:32
@MarekR `.dot` does matrix by matrix multiplication. `x*y` does term-by-term multiplication, which I didn't consider in my test. Amyway, term by term multiplication is way faster. In python `c=a*b` is executed in 2.5 ms (compare to original 15 ms in my quesiton) — Mikhail Genkin, Aug 02 '18 at 15:35
@MikhailGenkin, do the results change much if you increase `Nloops` in the C++ code (to, say, 100)? — Warren Weckesser, Aug 02 '18 at 15:50
@MarcGlisse, wow adding this option reduced computation time by a factor of 3 (now it is only 45-50 ms per loop). Taking into account, that multithreading was not used (there is a function in eigen that allows me to check it), I believe that unveils the mystery — Mikhail Genkin, Aug 02 '18 at 16:00
@n.m. , why not? I can use high-precision timer from c++11, but there is no need, as running times are all of the order of seconds — Mikhail Genkin, Aug 02 '18 at 16:01
Compiling with `-fopenmp` would enable multithreading, except that you are on Mac so you probably have a fake g++ and possibly no openmp. — Marc Glisse, Aug 02 '18 at 16:03
@MarcGlisse that is completely accurate. I am trying to set it up — Mikhail Genkin, Aug 02 '18 at 16:05
clock measures cpu time. any multithreading or sleep will throw it off. — n. m. could be an AI, Aug 02 '18 at 16:06
On mac you can easily install any version of g++/clang with OpenMP support through macport, or you can also ask Eigen to fallback to Apples's accelerate: `-framework Accelerate -DEIGEN_USE_BLAS`. — ggael, Aug 02 '18 at 16:19
I'm observing eigen being more than ten times *faster* than numpy, with `-Ofast -march=native` flags. (55 vs 700 ms per iteration) — n. m. could be an AI, Aug 02 '18 at 16:38
For Eigen, write `m3.noalias()=m1*m2;` to avoid a matrix copy. — chtz, Aug 02 '18 at 16:39
@n.m. This entirely depends on your BLAS implementation. Essentially, this benchmark compares Eigen's matrix multiplication with the BLAS used by numpy. — chtz, Aug 02 '18 at 16:41
@chtz I don't have any special BLAS. All I have is `apt-get install python-numpy`, like 99% of us. — n. m. could be an AI, Aug 02 '18 at 16:49
Thanks for the suggestions: with `-framework Accelerate -DEIGEN_USE_BLAS` time increases by a factor of two, so it does not help. with `-Ofast -march=native` time is the same as before (`O3 -march=native`). With `m3.noalias()=m1*m2` time decreases a little bit (by like 4%) — Mikhail Genkin, Aug 02 '18 at 16:51
@MikhailGenkin Did you see the comment by n.m.? The way you measure time doesn't do what you want for multithreading. — Marc Glisse, Aug 02 '18 at 18:01
@MarcGlisse yes, I have. The bigger problem is that I never used multithreading so far. I am currently struggling with OpenMP installation — Mikhail Genkin, Aug 02 '18 at 20:08
@BasileStarynkevitch CUDA is not possible with non-nVidia GPU. I never configured/installed OpenCL. np.show_config() never showed any trace of GPU — Mikhail Genkin, Aug 03 '18 at 15:23

score 16 · Accepted Answer · edited Jul 15 '19 at 12:11

After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
    c=a*b;
end
toc/100

2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8

I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

3) C++ performance. The following changes to the original (see the question) code were applied:

noalias function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno library

Here I used a bunch of different options and two different compilers:

3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time ~ 146 ms

3.2 Added -march=native option:

Execution time ~ 46 +-2 ms

3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):

gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time 222 ms

3.4 Added - march=native option:

Execution time ~ 45.5 +- 1 ms

At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp

Execution time: 16.5 +- 0.7 ms

3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native

Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported. Maybe it used to be, or happened to work.)

Conclusion:

Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.

Thanks for useful discussion,

Mikhail

Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms

Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

What version of GCC, of Clang? What processor? – Basile Starynkevitch Aug 03 '18 at 15:12 — Basile Starynkevitch, Aug 03 '18 at 15:12
Thanks, added in the begging of the answer – Mikhail Genkin Aug 03 '18 at 15:22 — Mikhail Genkin, Aug 03 '18 at 15:22

Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

1 Answers1

Linked