PCA().fit() is using the wrong axis for data input

Question

I'm using sklearn.decomposition.PCA to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv file using pandas. I then scale the data

training_data = StandardScaler().fit_transform(training[:,1:4096])

before calling the PCA algorithm to obtain the variance for each dimension,

pca = PCA(n_components)

pca.fit(training_data).

The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.

My code looks like:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]

training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]

pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.

I have tried taking the transpose of training_data, but this didn't change the output. I have also tried changing n_components in the argument of the PCA function, but it is insistent that there can only be 247 dimensions.

This may be a stupid question, but I'm very new to this sort of data processing. Thank you.

Something is wrong with your data. what does `print(training_data.shape)`return ? — seralouk, May 27 '20 at 14:02

seralouk · Accepted Answer · 2020-05-27T14:12:25.657

You said:

" but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point."

No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095).

On the other hand, you define:

pca = PCA() # this is actually PCA(n_components=None)

so n_components is set to None.

When this happens we have (see the documentation here):

n_components == min(n_samples, n_features)

Thus, in your case, you have min(247, 4095) = 247 components.

So, pca.explained_variance_. will be a vector with shape 247 since you have 247 PC dimensions.

Why do we have n_components == min(n_samples, n_features) ?

This is related to the rank of the covariance/correlation matrix. Having a data matrix X with shape [247,4095], the covariance/correlation matrix would be [4095,4095] with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.

Ok, I think I understand. Thank you. – Dan Pollard May 27 '20 at 14:10 — Dan Pollard, May 27 '20 at 14:10

PCA().fit() is using the wrong axis for data input

1 Answers1