14

I am trying to reduce the dimensionality of a very large matrix using PCA in Sklearn, but it produces a memory error (RAM required exceeds 128GB). I have already set copy=False and I'm using the less computationally expensive randomised PCA.

Is there a workaround? If not, what other dim reduction techniques could I use that require less memory. Thank you.


Update: the matrix I am trying to PCA is a set of feature vectors. It comes from passing a set of training images through a pretrained CNN. The matrix is [300000, 51200]. PCA components tried: 100 to 500.

I want to reduce its dimensionality so I can use these features to train an ML algo, such as XGBoost. Thank you.

Chris Parry
  • 2,937
  • 7
  • 30
  • 71
  • 2
    This makes me think of the [X Y problem](https://meta.stackexchange.com/a/66378/311624). Why do you want to reduce the dimensionality? What are you trying to achieve with that matrix? Is it a dense matrix? – iled Apr 11 '17 at 23:02
  • 1
    Please provide more information: how many components are you using, what's your input dataset size etc. – rth Apr 11 '17 at 23:03
  • You could try some type of feature reduction technique to remove any redundant/uninformative features from the set. – semore_1267 Apr 12 '17 at 01:08
  • What feature reduction technique would you suggest? Thank you. – Chris Parry Apr 12 '17 at 01:33
  • I am experiencing the same problem with KernelPCA reduction( . How to solve it in non-linear way? – Sultan1991 Feb 22 '20 at 12:56

3 Answers3

9

In the end, I used TruncatedSVD instead of PCA, which is capable of handling large matrices without memory issues:

from sklearn import decomposition

n_comp = 250
svd = decomposition.TruncatedSVD(n_components=n_comp, algorithm='arpack')
svd.fit(train_features)
print(svd.explained_variance_ratio_.sum())

train_features = svd.transform(train_features)
test_features = svd.transform(test_features)
Chris Parry
  • 2,937
  • 7
  • 30
  • 71
  • 6
    Just adding this as a side note, but how were you able to still compute the correct results? AFAIK, PCA centers the data, which you would have to do manually for TruncatedSVD. – dennlinger Jun 12 '18 at 08:36
5

You Could use IncrementalPCA available in SK learn. from sklearn.decomposition import IncrementalPCA. Rest of the interface is same as PCA. You need to pass an extra argument batch_size, which needs to <= #components.

However, in case there is a need to apply a non linear version like KernelPCA there does not seem to be a support for the something similar. KernelPCA absolutely explodes in it's memory requirement, see this article about Non Linear Dimensionality Reduction on Wikipedia

Vivek Puurkayastha
  • 466
  • 1
  • 9
  • 18
  • Has anybody come up with a way to run ```KernelPCA``` or similar nonlinear PCA? It has crazy RAM requirements even on small matrices... – Moysey Abramowitz Feb 13 '22 at 14:47
0
import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

# Split data into training and test
X, y = mnist["data"], mnist["target"]
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
del mnist

# Use Incremental PCA to avoid MemoryError: Unable to allocate array with shape
from sklearn.decomposition import IncrementalPCA
m, n = X_train.shape
n_batches = 100
n_components=154

ipca = IncrementalPCA(
    copy=False,
    n_components=n_components,
    batch_size=(m // n_batches)
)
X_train_recuced_ipca = ipca.fit_transform(X_train)
mon
  • 18,789
  • 22
  • 112
  • 205