How to run PCA with dask_ml. I am getting an error, "This function (tsqr) supports QR decomposition in the case of tall-and-skinny matrices"?

Question

I want to perform dimensionality reduction over data with around 3000 rows and 6000 columns. Here the number of observations (n_samples) < number of features (n_columns). I am not able to achieve the result using dask-ml whereas the same is possible through scikit learn. What modifications do I need to perform to my existing code?

#### dask_ml
from dask_ml.decomposition import PCA
from dask_ml import preprocessing
import dask.array as da
import numpy as np

train = np.random.rand(3000,6000)
train = da.from_array(train,chunks=(100,100))
complete_pca = PCA().fit(train)

#### scikit learn
from sklearn.decomposition import PCA
from sklearn import preprocessing
import numpy as np

train = np.random.rand(3000,6000)
complete_pca = PCA().fit(train)

score 0 · Accepted Answer · answered Feb 20 '19 at 00:30

0

The PCA algorithm in Dask-ML is only designed for tall-and-skinny matrices. You could try using the raw SVD algorithms in dask.array. Also, with a 3000x6000 matrix you can probably also just use a single machine.

Adding in something like Dask-ML for a problem of this size might be adding more complexity than you need. If Scikit-Learn works for you then I would stick with that.

answered Feb 20 '19 at 00:30

MRocklin

55,641
23
163
235

Thanks for your suggestion. 3000x6000 was just a sample. My data set is something like this 1000000x1300000. Will try to implement SVD on it. – Shaik Mohammad Abdullah A Feb 21 '19 at 05:50

How to run PCA with dask_ml. I am getting an error, "This function (tsqr) supports QR decomposition in the case of tall-and-skinny matrices"?

1 Answers1