Doing PCA in java on large matrix

Question

I have a very large matrix (about 500000 * 20000) containing the data that I would analyze with pca. To do this I'm using ParallelColt library, but both using singular value decomposition and eigenvalues decomposition in order to get the eigenvectors and eigenvalues of the covariance matrix. But these methods waste the heap and I get "OutOfMemory" errors...

Also using SparseDoubleMatrix2D (the data are very sparse) the errors still remain, so I ask you : how can I solve this problem ?

Change library ?

Is Java the only language considered, I can imagine this matrix is insanely big......? — Roman Byshko, Dec 05 '11 at 23:47
I don't see how switching to another language would change anything. — duffymo, Dec 05 '11 at 23:53

score 2 · Answer 1 · answered Dec 05 '11 at 23:50

You can compute PCA with Oja's rule : it's an iterative algorithm, improving an estimate of the PCA, one vector a time. It's slower than the usual PCA, but requires you to store only one vector in memory. It's also very numerically stable

http://en.wikipedia.org/wiki/Oja%27s_rule

duffymo · Answer 2 · 2011-12-06T00:57:06.823

0

I'm not sure that changing libraries will help. You're going to need doubles (8 bytes per). I don't know what the dimension of the covariance matrix would be in this case, but switching libraries won't change the underlying calculations much.

What is the -Xmx setting when you run? What about the perm gen size? Perhaps you can increase them.

Does the algorithm halt immediately or does it run for a while? If it's the latter, you can attach to the process using Visual VM 1.3.3 (download and install all the plugins). It'll let you see what's happening on the heap, threads, etc. Could help you ferret out the root cause.

A Google search for "Java eigenvalue of large matricies" turned up this library from Google. If you scroll down in the comments I wonder of a block Lanczos eigenvalue analysis might help. It might be enough if you can get a subset of the eigenvalues.

These SVM implementations claim to be useful for large datasets:

http://www.support-vector-machines.org/SVM_soft.html

I don't think you can ask for more than 2GB for a JVM:

http://www.theserverside.com/discussions/thread.tss?thread_id=26347

According to Oracle, you'll need a 64-bit JVM running on a 64-bit OS:

http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#gc_heap_32bit

edited Dec 06 '11 at 00:57

answered Dec 05 '11 at 23:51

duffymo

305,152
44
369
561

dim of the result will be 500000x500000. – Roman Byshko Dec 05 '11 at 23:55
Sure about that? Not 20K x 20K? – duffymo Dec 05 '11 at 23:56
This is covariance matrix. (X is input) http://upload.wikimedia.org/wikipedia/en/math/6/7/6/67616c643a158c1e00a8e4d5ec3d0b1a.png – Roman Byshko Dec 05 '11 at 23:59
let's better run away....... :) I think he should do it block by block, saving intermediate results to HDD, or? – Roman Byshko Dec 06 '11 at 00:04
LOL - I don't know if SVD can work that way, because you have to construct the orthonormal basis. But that's a good research direction. I like that earlier iterative recommendation. Is there an SVD with iterative correction? – duffymo Dec 06 '11 at 00:07
The matrix (call it A) is a dataset of document that will be used to train a svm : in particular there is a row for each document and a column for each single word. So the covariance matrix is transpose(A)*A. I've tried to set -Xmx3500M and higer, but this doesn't help. Initially the matrix is a SparseDouble2D matrix and it occupies very little memory ( < 50 Mb ) , but when I calculate the mean of each column and subtract it, the matrix will not be a sparse matrix and the momory increse column by column... – dacanalr Dec 06 '11 at 00:10
Okay, so if it's the limiting case of sparse (diagonal), that means 8 bytes/double * 500,000 doubles = 4e6 bytes to start. Max is a full matrix = 8 bytes/double * (500,000*20,000) doubles = 8e10 bytes. The OP's situation is somewhere between those extremes. – duffymo Dec 06 '11 at 00:11
I don't think you can ask for more than 2GB for a JVM: http://www.theserverside.com/discussions/thread.tss?thread_id=26347 – duffymo Dec 06 '11 at 00:11

score 0 · Answer 3 · answered Dec 07 '11 at 17:09

I built some sparse, incremental algorithms for just this sort of problem. Conveniently, it's built on top of Colt.

See the HallMarshalMartin class in trickl-cluster library below. You can feed it chunks of rows at a time, so it should solve your memory issues.

The code is available under the GPL. I'm afraid I've only just released it, so it's short on documentation, hopefully it's fairly self explanatory. There are JUnit tests that should help with usage.

http://open.trickl.com/trickl-pca/index.html

Doing PCA in java on large matrix

3 Answers3