How to accelerate matrix multiplications in Python?

Question

I am developing a small neural network whose parameters need a lot of optimization, so a lot of processing time. I have profiled my script with cProfile and what takes 80% of the processor time is the NumPy dot function, the rest is matrix inversion with the function numpy.linalg.solve. My current version of numpy uses blas, or it is what it seems, since numpy.core._dotblas.dot appears as the function that takes 80% of the total time of processing.

As it is the core of my neural network and as I have to run this a lot, any minor speed gain could save me a lot of time over the numerous repeated parameters optimizations.

More precisions: the matrix multiplication is on matrices that have a shape of minimum 100*100 up to 500*500. I have a computer with 12 cores and use them so far to run different neural network parameters optimization in parallel, but maybe the matrix multiplication could be done in parallel?

Thank you for your time!

Answer:

I spent few days testing and installing uninstalling libraries... Here is the result of what I tested: By default on my version of Ubuntu (12.04) and respository installed version of Numpy, the BLAS libraries are ATLAS libraries. I made some tests that reflect the improvement SPECIFICALLY on the computations I am interested in, so these results must not be interpreted as the final answer. These computations involve a matrix multiplication (dot product) in a 55000 iterations loop, with a 500*500 and 1000*1000 matrix. I use a HP Z800 workstation with a Xeon X5675 @ 3.07GHZ with 12 cores. All the results (percentage) are the comparison between the described condition and the reference which here is the packaged ATLAS library.

Scipy.sparse module: I don't know if I set it correctly but with a 10% sparseness, using this module becomes useful starting from 1500*1500 matrices with OpenBLAS and MKL. If you have suggestion about how to use them properly I am interested!
With OpenBlas I get speed increase of 33% for 500*500 matrices but 160% for 1000*1000. But with OpenBLAS, the scipy.sparse module does not perform better but worse in fact.
The big winner here is the MKL libraries. The acceleration goes up to 230% with 1000*1000 matrices from the original ATLAS libraries! For the 500*500 matrices, the acceleration is more modest (100%) but still very good. Furthermore with the compilation with OpenMP, matrix multiplications can run on my 12 processors and here it twice as fast than on one processor with MKL libraries. But it is a waste of processing power, it is much more efficient to use multiprocessing modules to run scripts/matrix-multiplications in parallel.

Danica · Accepted Answer · 2012-09-02T20:15:57.590

7

If you're not already, you could try linking numpy to a very optimized BLAS library like Intel MKL (which is free-as-in-beer for non-commercial use or discounted for academic use, which apparently doesn't count as non-commercial; instructions from Intel for using it with numpy) or OpenBLAS (free-as-in-speech). There's also the Enthought Python Distribution, which is pre-linked to MKL and free-as-in-beer for academics. That can parallelize your matrix multiplications automatically and can be much faster than the typical reference BLAS / ATLAS installation on most Linux distros, or whatever it is you're using.

Otherwise, the only thing I know of that you could do would be some mathematical tricks to not have to compute as many multiplications / solves. Without knowing exactly what you're doing it's hard to give any suggestions there.

I'm assuming that your matrices are dense, since they usually are in neural nets, but if you're doing something unusual scipy.sparse might help too.

edited Sep 02 '12 at 20:15

answered Sep 02 '12 at 19:52

Danica

28,423
6
90
122

Openblas is likely a good free option that could speed things up considerably. It should be pretty easily available on most Linux system for example. – seberg Sep 02 '12 at 20:01
I didn't realize that MKL isn't free (as in beer); [OpenBLAS](http://xianyi.github.com/OpenBLAS/) is probably a good alternative. EPD is free to academics, though. – Danica Sep 02 '12 at 20:11
My neural net is actually sparse (10% connectivity), I get a 20% speed up, which isn't much but better than nothing. You talk about OpenBLAS, will it run faster than my current version of numpy which already use a version of BLAS? – PierreE Sep 03 '12 at 11:25
@pierotiste I wouldn't scoff at 20%! And the speedup from OpenBLAS or MKL would depend entirely on what library you're currently using; sometimes Linux distros use a very straightforward and untuned Fortran implementation, which would mean you might get a big speedup, or you might not. (I don't know if OpenBLaS does the sparse stuff actually, but I think MKL might?) – Danica Sep 03 '12 at 15:34
1

So I installed the Ubuntu OpenBLAS packages and linked it so that it is now used by numpy (is it how to install it? because I didn't have to compile anything...) and it is now more than twice faster without the scipy.sparse dot function. It is slower with it... but MKL is not free. – PierreE Sep 03 '12 at 15:44
@pierotiste It is for non-commercial use, as linked to in my answer. You should also make sure that you re-linked scipy as well as numpy. – Danica Sep 03 '12 at 15:48
@Dougal Actually the speed up goes up to 500% for a 2000*2000 matrix with 10% sparseness with OpenBLAS and scipy.sparse! It's a victory! Thank you very much! – PierreE Sep 03 '12 at 16:10

Soravux · Answer 2 · 2016-04-03T01:16:24.887

Numpy uses really fast internal algorithms and representations based on third-party libraries (such as BLAS, as you named it) already using SSE optimizations among others. Because the original BLAS is a tad slow (because it aims to be a reference implementation, focusing on precision rather than performance), you may wish to use another flavor focused on performance, such as OpenBLAS. To use OpenBLAS, you need to either find a pre-built OpenBLAS-enabled Numpy package or recompile a version linked against OpenBLAS. Once you are using an efficient BLAS implementation, you won't find a better speedup option in pure python, unless you write a library in C and take much time to optimize it.

On the other hand, you can check if your Numpy and BLAS library are compiled as efficiently as possible on your architecture. For instance, if you can activate the OpenMP library on Numpy compilation, it would allow multiple cores to work on your problem using data-level parallelism. This can be a significant source of speedup if you possess multiple cores on your computer and your computations are CPU-bound. If your kind of problem allows it, you could even use a task-based parallel programming library (SCOOP [Disclamer: I wrote it], Celery, etc.) to propagate your work on multiple computers.

As a last resort, another possibility would be to buy new hardware. It makes software potentially go faster without changing a single line of code.

Thanks, I will check which version of Blas is installed and try to compile with OpenMP activated. How complicated is it? — PierreE, Sep 03 '12 at 11:27
@pierotiste: It should not be that hard on a *nix-based system. It should consist of recompiling Numpy while linking with the new libraries and/or flags. You should check online for blogs or the Numpy manual for more details. Depending on the libraries you choose, it may changes the required steps. — Soravux, Sep 03 '12 at 23:20

How to accelerate matrix multiplications in Python?

2 Answers2