numpy: compute x.T*x for a large matrix

Question

In numpy, what's the most efficient way to compute x.T * x, where x is a large (200,000 x 1000) dense float32 matrix and .T is the transpose operator?

For the avoidance of doubt, the result is 1000 x 1000.

edit: In my original question I stated that np.dot(x.T, x) was taking hours. It turned out that I had some NaNs sneak into the matrix, and for some reason that was completely killing the performance of np.dot (any insights as to why?) This is now resolved, but the original question stands.

score 10 · Accepted Answer · answered Dec 07 '10 at 12:44

This may not be the answer you're looking for, but one way to speed it up considerably is to use a gpu instead of your cpu. If you have a decently powerful graphics card around, it'll outperform your cpu any day, even if your system is very well tuned.

For nice integration with numpy, you could use theano (if your graphics card is made by nvidia). The calculation in the following code runs for me in couple of seconds (although I have a very powerful graphics card):

$ THEANO_FLAGS=device=gpu0 python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 480
>>> from theano import tensor as T
>>> import numpy
>>> x = numpy.ones((200000, 1000), dtype=numpy.float32)
>>> m = T.matrix() 
>>> mTm = T.dot(m.T, m)
>>> f = theano.function([m], mTm)
>>> f(x)
array([[ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       ..., 
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.],
       [ 200000.,  200000.,  200000., ...,  200000.,  200000.,  200000.]], dtype=float32)
>>> r = f(x)
>>> r.shape
(1000, 1000)

I was going to wait to find out how long >>> numpy.dot(x.T, x) took by way of comparison, but I got bored...

You can also try PyCuda or PyOpenCL (if you don't have an nvidia graphics card), although I don't know if their numpy support is as straightforward.

It just occurred to me that with matrices this big, memory will be a constraint with many graphics cards. Something to watch out for. — Josh Bleecher Snyder, Dec 07 '10 at 12:45
although this is indeed a nice alternative, it's certainly not normal that his cpu multiplication takes so much time — steabert, Dec 07 '10 at 13:00
Many thanks for the pointer. This is very interesting, and is certainly worth looking into. — NPE, Dec 07 '10 at 13:00
@JoshBleecherSnyder - what would you recommend to do for calculating this on a GPU for a matrix much larger than that? I have a 40,000x100 matrix so the result is 40,000x40,000 which is way to big for my humble GPU. I am trying to think of an efficient way to break the process into manageable blocks somehow. — odedbd, Apr 10 '16 at 14:46

David Cournapeau · Answer 2 · 2010-12-10T03:29:41.353

5

First, make sure you use an optimized blas/lapack, this can make a tremendous difference (up to one order of magnitude). If you use a threaded ATLAS, for example, it will use all your cores relatively efficiently (you need to use a recent ATLAS, though, and compiling ATLAS is a PITA).

As for why Nan slows everything done: that's pretty much unavoidable, NaN handling is much slower than "normal" float at the CPU level: http://www.cygnus-software.com/papers/x86andinfinity.html. It depends on the CPU model, what kind of instruction set you are using, and of course the algorithms/implementation you are using.

edited Dec 10 '10 at 03:29

answered Dec 09 '10 at 05:46

David Cournapeau

78,318
8
63
70

Do you have any references to back up that `NaN handling is much slower than "normal" float at the CPU level`? The only thing I was able to find is http://stackoverflow.com/questions/3606054/how-slow-is-nan-arithmetic-in-the-intel-x64-fpu/3606088#3606088 Thanks – NPE Dec 09 '10 at 11:15
added one link. Slowdown depends of many parameters, so it is hard to pin down one reason, and it is a case-per-case thing. – David Cournapeau Dec 10 '10 at 03:30

steabert · Answer 3 · 2010-12-07T12:29:35.843

2

hmm, x is about 800 Mb, assuming it needs the same for the result, are you sure you have enough physical memory and it's not swapping?

other than that, numpy should use a BLAS function, and even though the default library that numpy uses may be relatively slow, it should work ok for this size.

edit

import numpy as npy
import time

def mm_timing():
  print "   n   Gflops/s"
  print "==============="
  m = 1000
  n = 200000
  a = npy.random.rand(n, m)
  flops = (2 * float(n) - 1) * float(m)**2
  t1 = time.time()
  c = npy.dot(a.T, a)
  t2 = time.time()
  perf = flops / (t2 - t1) / 1.e9
  print "%4i" % n + "     " + "%6.3f" % perf

mm_timing()

edited Dec 07 '10 at 12:29

answered Dec 07 '10 at 10:52

steabert

6,540
2
26
32

@steabert Pretty sure it's not swapping (as evidenced by `vmstat`). Also, it's taking 100% of a core, which it wouldn't be if it was I/O-bound. Something else must be going on. – NPE Dec 07 '10 at 10:54
what FLOPS do you measure for the matrix multiplication for some smaller matrices? – steabert Dec 07 '10 at 10:56
The following benchmark achieves about 4 Gflops/s on 2048x2048 matrices: https://bugzilla.redhat.com/show_bug.cgi?id=461472 – NPE Dec 07 '10 at 11:55
that seems fine, you can adjust the code of the benchmark (put it in my answer as an edit) If it runs fine, then either there is something wrong with your code, or with using the float32 – steabert Dec 07 '10 at 12:26
Thanks, I've been experimenting with essentially the same code. It works reasonably quickly on random data of the same dimensions as the real thing. I've also already established that `float32` isn't the culprit (roughly the same performance on random `float32` data as on `float64`.) So there's something else about my matrix that makes it slow to a crawl... – NPE Dec 07 '10 at 13:09
1

could you post some relevant pieces of code so we can try to reproduce it? also you could try to reduce the matrix size before trying to find the problem. – steabert Dec 07 '10 at 13:11
1

I've just cracked it -- the matrix contained some NaNs, which evidently were killing the performance (no idea why though.) I'll update the question. – NPE Dec 07 '10 at 13:21

numpy: compute x.T*x for a large matrix

3 Answers3