What is the best matrix multiplication algorithm?

Question

What is the best matrix multiplication algorithm? What means 'the best'for me? It means the fastest and ready for todays machines.

Please give links to pseudocode if you can.

Do you want one for general matrices, or do you have any useful constraints on the matrices, such as upper triangular or diagonal? — Greg Hewgill, Dec 15 '10 at 22:36
Is this for a program, or something else like schoolwork? Matrix multiplication is fairly simple; 1) check that the dimensions agree (e.g. 3x3 * 3x1 and not 3x3 * 1x3), 2) multiply the corresponding fields together and add to arrive at the final field. http://en.wikipedia.org/wiki/Matrix_multiplication — Eaglebird, Dec 15 '10 at 22:38

Antonin Portelli · Accepted Answer · 2015-12-07T14:13:02.720

BLAS is the best ready-to-use efficient matrix multiplication library. There are many different implementation. Here is a benchmark I made for some implementations on a MacBook Pro with dual-core Intel Core 2 Duo 2.66 GHz :

gotoBLAS2 (open-source) : https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2
ATLAS (open-source) : http://math-atlas.sourceforge.net/
Accelerate.framework (Apple) : http://developer.apple.com/performance/accelerateframework.html
a non-optimized, but portable, implementation that I called 'vanilla' (from the GSL)

alt text

There are also other commercial implementations that I didn't test here :

MKL (Intel) : http://software.intel.com/en-us/articles/intel-mkl/
ACML (AMD) : http://developer.amd.com/cpu/Libraries/acml/Pages/default.aspx

The link to _gotoBLAS2_ is dead now. – ForceBru Dec 06 '15 at 13:54 — ForceBru, Dec 06 '15 at 13:54
I made an update to the new link. – Antonin Portelli Dec 07 '15 at 14:13 — Antonin Portelli, Dec 07 '15 at 14:13

score 9 · Answer 2 · answered Dec 15 '10 at 22:50

9

The best matrix multiplication algorithm is the one that someone with detailed architectural knowledge has already hand-tuned for your target platform.

There are lots of good libraries that supply tuned matrix-multiply implementations. Use one of them.

answered Dec 15 '10 at 22:50

Stephen Canon

103,815
19
183
269

score 8 · Answer 3 · answered Dec 15 '10 at 22:53

8

There are probably better ones but these are the ones I've head of (better than the standard cubic complexity algorithm).

Strassen's - O(N^2.8)

Coppersmith Winograd - O(N^2.376)

answered Dec 15 '10 at 22:53

Sonny Saluja

7,193
2
25
39

3

These have a better asymptotic complexity than the "standard" O(N^3) algorithm, but the constant is (at least for Coppersmith-Winograd) is prohibitively large for moderate-sized matrices. See this post on mathoverflow: http://mathoverflow.net/questions/1743/what-is-the-constant-of-the-coppersmith-winograd-matrix-multiplication-algorithm – celion Dec 16 '10 at 00:10
1

Excellent point. That makes me wonder if the matrix libraries mentioned by Jim adapt the algorithm based on the input size. – Sonny Saluja Dec 16 '10 at 00:29
As I understand it, you can break the multiplcation in blocks, such that all the data you're working on fits in cache, so that gives you a good constant speedup. ATLAS actually benchmarks itself to tune its parameters. – celion Dec 16 '10 at 09:07

score 6 · Answer 4 · answered Dec 15 '10 at 22:38

6

Why pseudocode? Why implement it yourself? If speed is your concern, there are highly optimized algorithms available that include optimizations for specific instruction sets (e.g. SIMD), implementing those all by yourself offers no real benefit (apart from maybe learning),

Take a look at different BLAS implementations, like:

http://www.netlib.org/blas/

http://math-atlas.sourceforge.net/

answered Dec 15 '10 at 22:38

Jim Brissom

31,821
4
39
33

I'd also look at cublas if you have a beefy graphics card available. (There are also wrappers in lots of languages if c isn't your thing.) – Josh Bleecher Snyder Dec 15 '10 at 22:54

comingstorm · Answer 5 · 2010-12-16T18:13:49.750

Depends on the size of the matrix, and whether it's sparse or not.

For small-to-medium-sized dense matrices, I believe that some variation on the "naive" O(N^3) algorithm is a win, if you pay attention to cache-coherence and use the platform's vector instructions.

Data arrangement is important -- for cases where your standard matrix layout is cache-unfriendly (e.g., column-major * row-major), you should try binary decomposition of your matrix multiplication -- even if you don't use Strassen's or other "fast" algorithms, this order of operations can yield a "cache-oblivious" algorithm that automatically makes good use of every level of cache. If you have the luxury to rearrange your matrices, you might try combining this with a bit-interleaved (or "Z-order") ordering of data elements.

Finally, remember: premature optimization is the root of all evil. And when it's not premature any more, always profile & benchmark before, during, and after optimizing....

score 3 · Answer 6 · answered Dec 16 '10 at 05:38

Here is algorithms course of MIT and the matrix multiplication lecture

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-19-shortest-paths-iii-all-pairs-shortest-paths-matrix-multiplication-floyd-warshall-johnson/

matrix multiplication - O(n^3)

Strassen’s algorithm - O(n^2.8) http://en.wikipedia.org/wiki/Strassen_algorithm

Coppersmith–Winograd - O(n^2.376) http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

score 1 · Answer 7 · answered Dec 15 '10 at 23:07

There is no "best algorithm" for all matrices on all modern CPUs.

You will need to do some research into the many methods available, and then find a best-fit solution to the particular problems you are calculating on the particular hardware you are dealing with.

For example, the "fastest" way on your hardware platform may be to use a "slow" algorithm but ask your GPU to apply it to 256 matrices in parallel. Or using a "fast" general-purpose (mxn) algorithm may produce much slower results than using an optimised 3x3 matrix multiply. If you really want it to be fast then you may want to consider getting down to the bare metal to make sure you make best use of specific CPU features like SIMD instructions, branch prediction and cache coherence, at the expense of portability.

score 0 · Answer 8 · answered Dec 15 '10 at 22:37

0

There is an algorithm call the Cannon's algorithm a distributed matrix multiplication algorithm. More here

answered Dec 15 '10 at 22:37

cristian

8,676
3
38
44

What is the best matrix multiplication algorithm?

8 Answers8

Linked