26

I have a large NxN dense symmetric matrix and want the eigenvectors corresponding to the k largest eigenvalues. What's the best way to find them (preferably using numpy but perhaps in general using blas/atlas/lapack if that's the only way to go)? In general N is much much larger then k (say N > 5000, k < 10).

Numpy seems to only have functions for finding the k largest eigenvalues if my starting matrix is sparse.

Anthony Bak
  • 1,123
  • 3
  • 10
  • 16

2 Answers2

22

In SciPy, you can use the linalg.eigh function, with the eigvals parameter.

eigvals : tuple (lo, hi) Indexes of the smallest and largest (in ascending order) eigenvalues and corresponding eigenvectors to be returned: 0 <= lo < hi <= M-1. If omitted, all eigenvalues and eigenvectors are returned.

Which in your case should be set to (N-k,N-1).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 3
    The non-sparse method was fastest method for me. Using Giuliano's benchmark script with k=2 I get eigh elapsed time: 93.704689 eigsh elapsed time: 353.433379 eig elapsed time: 870.060089 The last time is for numpy.linalg.eig. This is on my macbook pro. – Anthony Bak Aug 30 '12 at 17:06
7

Actually the sparse routines work for dense numpy arrays also, I think they use some kind of Krylov subspace iteration, therefore they need to compute several matrix-vector products, which means that if your k << N, the sparse routinescould be (marginally?) faster.

Check out the docs http://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html

and the following code (go to take a good coffee with friends until it ends)

import numpy as np
from time import clock
from scipy.linalg import eigh as largest_eigh
from scipy.sparse.linalg.eigen.arpack import eigsh as largest_eigsh

np.set_printoptions(suppress=True)
np.random.seed(0)
N=5000
k=10
X = np.random.random((N,N)) - 0.5
X = np.dot(X, X.T) #create a symmetric matrix

# Benchmark the dense routine
start = clock()
evals_large, evecs_large = largest_eigh(X, eigvals=(N-k,N-1))
elapsed = (clock() - start)
print "eigh elapsed time: ", elapsed

# Benchmark the sparse routine
start = clock()
evals_large_sparse, evecs_large_sparse = largest_eigsh(X, k, which='LM')
elapsed = (clock() - start)
print "eigsh elapsed time: ", elapsed
Giuliano
  • 640
  • 3
  • 15
  • With several examples I've tried of "small" `k`, I get 44seconds vs 18seconds (`eigsh` being the faster), when `k=2` they are approximately the same, when `k=1` (strangely) or `k` is "large" `eigsh` is considerably slower, in all cases `eigh` takes around 44seconds. There must be more efficient algorithm to do this, which you would expect could find the largest eigenvalue-pair in orders of magnitude less time than many/all... – Andy Hayden Aug 29 '12 at 11:08
  • 1
    That is why I said 'could'... i don't really know! AFAIK most methods for determination of dominant eigenvalues are evolutions of the good old power iteration method, meaning that they need to do A*x many times, and with N=5000 and A full this may be not ideal. In any case OP was asking for what was available in numpy/scipy, and it turns out that he can choose between 2 methods. – Giuliano Aug 29 '12 at 11:46
  • And I think the answer is: sparse routine *is* faster in the OP's case! :) – Andy Hayden Aug 29 '12 at 12:01
  • With such a small margin is not said, other considerations may play a role, e.g. the specific BLAS/LINPACK implementation available on the machine. In my case, with k=10 i get the same ratio as you between cpu times (on a 2008 MacBookPro) but it is four times slower. But if OP has a blazing fast BLAS/LINPACK, he may even not notice the difference... – Giuliano Aug 29 '12 at 12:24
  • On my laptop it turns out the non-sparse method was fastest. – Anthony Bak Aug 30 '12 at 17:11
  • BLAS / LAPACK can be twice as fast with float32 as float64, but "Single-precision types in (arpack) `eigs` and `eighs` are not supported currently." -- Mac OSX. – denis Sep 03 '12 at 09:19