1

I'm using sklearn.decomposition.PCA to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv file using pandas. I then scale the data

training_data = StandardScaler().fit_transform(training[:,1:4096])

before calling the PCA algorithm to obtain the variance for each dimension,

pca = PCA(n_components)

pca.fit(training_data).

The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.

My code looks like:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]

training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]

pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.

I have tried taking the transpose of training_data, but this didn't change the output. I have also tried changing n_components in the argument of the PCA function, but it is insistent that there can only be 247 dimensions.

This may be a stupid question, but I'm very new to this sort of data processing. Thank you.

seralouk
  • 30,938
  • 9
  • 118
  • 133
Dan Pollard
  • 51
  • 1
  • 7

1 Answers1

1

You said:

" but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point."

No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095).


On the other hand, you define:

pca = PCA() # this is actually PCA(n_components=None)

so n_components is set to None.


When this happens we have (see the documentation here):

n_components == min(n_samples, n_features)

Thus, in your case, you have min(247, 4095) = 247 components.

So, pca.explained_variance_. will be a vector with shape 247 since you have 247 PC dimensions.


Why do we have n_components == min(n_samples, n_features) ?

This is related to the rank of the covariance/correlation matrix. Having a data matrix X with shape [247,4095], the covariance/correlation matrix would be [4095,4095] with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.

seralouk
  • 30,938
  • 9
  • 118
  • 133