14

What is the best matrix multiplication algorithm? What means 'the best'for me? It means the fastest and ready for todays machines.

Please give links to pseudocode if you can.

guest
  • 1,696
  • 4
  • 20
  • 31

8 Answers8

14

BLAS is the best ready-to-use efficient matrix multiplication library. There are many different implementation. Here is a benchmark I made for some implementations on a MacBook Pro with dual-core Intel Core 2 Duo 2.66 GHz :

alt text

There are also other commercial implementations that I didn't test here :

Antonin Portelli
  • 688
  • 5
  • 12
9

The best matrix multiplication algorithm is the one that someone with detailed architectural knowledge has already hand-tuned for your target platform.

There are lots of good libraries that supply tuned matrix-multiply implementations. Use one of them.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
8

There are probably better ones but these are the ones I've head of (better than the standard cubic complexity algorithm).

Strassen's - O(N^2.8)

Coppersmith Winograd - O(N^2.376)

Sonny Saluja
  • 7,193
  • 2
  • 25
  • 39
  • 3
    These have a better asymptotic complexity than the "standard" O(N^3) algorithm, but the constant is (at least for Coppersmith-Winograd) is prohibitively large for moderate-sized matrices. See this post on mathoverflow: http://mathoverflow.net/questions/1743/what-is-the-constant-of-the-coppersmith-winograd-matrix-multiplication-algorithm – celion Dec 16 '10 at 00:10
  • 1
    Excellent point. That makes me wonder if the matrix libraries mentioned by Jim adapt the algorithm based on the input size. – Sonny Saluja Dec 16 '10 at 00:29
  • As I understand it, you can break the multiplcation in blocks, such that all the data you're working on fits in cache, so that gives you a good constant speedup. ATLAS actually benchmarks itself to tune its parameters. – celion Dec 16 '10 at 09:07
6

Why pseudocode? Why implement it yourself? If speed is your concern, there are highly optimized algorithms available that include optimizations for specific instruction sets (e.g. SIMD), implementing those all by yourself offers no real benefit (apart from maybe learning),

Take a look at different BLAS implementations, like:

http://www.netlib.org/blas/

http://math-atlas.sourceforge.net/

Jim Brissom
  • 31,821
  • 4
  • 39
  • 33
3

Depends on the size of the matrix, and whether it's sparse or not.

For small-to-medium-sized dense matrices, I believe that some variation on the "naive" O(N^3) algorithm is a win, if you pay attention to cache-coherence and use the platform's vector instructions.

Data arrangement is important -- for cases where your standard matrix layout is cache-unfriendly (e.g., column-major * row-major), you should try binary decomposition of your matrix multiplication -- even if you don't use Strassen's or other "fast" algorithms, this order of operations can yield a "cache-oblivious" algorithm that automatically makes good use of every level of cache. If you have the luxury to rearrange your matrices, you might try combining this with a bit-interleaved (or "Z-order") ordering of data elements.

Finally, remember: premature optimization is the root of all evil. And when it's not premature any more, always profile & benchmark before, during, and after optimizing....

comingstorm
  • 25,557
  • 3
  • 43
  • 67
1

There is no "best algorithm" for all matrices on all modern CPUs.

You will need to do some research into the many methods available, and then find a best-fit solution to the particular problems you are calculating on the particular hardware you are dealing with.

For example, the "fastest" way on your hardware platform may be to use a "slow" algorithm but ask your GPU to apply it to 256 matrices in parallel. Or using a "fast" general-purpose (mxn) algorithm may produce much slower results than using an optimised 3x3 matrix multiply. If you really want it to be fast then you may want to consider getting down to the bare metal to make sure you make best use of specific CPU features like SIMD instructions, branch prediction and cache coherence, at the expense of portability.

Jason Williams
  • 56,972
  • 11
  • 108
  • 137
0

There is an algorithm call the Cannon's algorithm a distributed matrix multiplication algorithm. More here

cristian
  • 8,676
  • 3
  • 38
  • 44