1

I am trying to perform the python implementation of PCA using sklearn. I have created the following function:

def dimensionality_reduction(train_dataset_mod1, train_dataset_mod2, test_dataset_mod1, test_dataset_mod2):

  pca = PCA(n_components= 200)
  pca.fit(train_dataset_mod1.transpose())
  mod1_features_train = pca.components_
  pca2 = PCA(n_components=200)
  pca2.fit(train_dataset_mod2.transpose())
  mod2_features_train = pca2.components_
  mod1_features_test = pca.transform(test_dataset_mod1)
  mod2_features_test = pca2.transform(test_dataset_mod2)

  return mod1_features_train.transpose(), mod2_features_train.transpose(), mod1_features_test, mod2_features_test

The size of my matrices are the following:

train_dataset_mod1 733x5000
test_dataset_mod1 360x5000
mod1_features_train 200x733
train_dataset_mod2 733x8000
test_dataset_mod2 360x8000
mod2_features_train 200x733

However when I am trying to run the whole script I am receiving the following message:

File "\Anaconda2\lib\site-packages\sklearn\decomposition\base.py", line 132, in transform X = X - self.mean_

What is the issue? How can I apply the pca to the test data?

Here an example of the debugging of pca for mod1:

enter image description here

The transformed dataset mod1_features_train and mod1_features_train having the correct size both 500x733. However I cannot do the same with test_dataset_mod1 and test_dataset_mod2, why?

EDIT: During the debugging I noticed that the base.py file of pca, there is an operation X = X - self.mean where X is my test data and self_mean the mean calculated from the fit into the train set (the size of the slf_mean is 733 which does not match with the X). If i remove the transpose() in the training process the pca is working normally without errors, the test_dataset_mod1 and test_dataset_mod2 having correct size 360x500, however, the train_dataset_mod1 and train_dataset_mod2 having wrong sizes 5000x500???

konstantin
  • 853
  • 4
  • 16
  • 50

1 Answers1

1

you shouldn't have transpose your matrix in in fit function or if you have to , you have to transpose your matrix in the transform function :

pca.fit(train_dataset_mod1)
  pca2.fit(train_dataset_mod2)
  mod1_features_test = pca.transform(test_dataset_mod1)
  mod2_features_test = pca2.transform(test_dataset_mod2)

or :

pca.fit(train_dataset_mod1.transpose())
  pca2.fit(train_dataset_mod2.transpose())
  mod1_features_test = pca.transform(test_dataset_mod1.transpose())
  mod2_features_test = pca2.transform(test_dataset_mod2.transpose())
Javad Sameri
  • 1,218
  • 3
  • 17
  • 30
  • Tried both of them and i received an error in both cases. If I will not put transpose in the train data then the pca is not performed in the features but it is performed in the samples which is not useful. For the second solution when I tried to transform the test data either using transpose or not I am receiving the same message.ValueError: operands could not be broadcast together with shapes (5000,360) (733,) – konstantin Jun 02 '17 at 08:24
  • The first approach is working but the result, the dimensionality reduction is taking place in the samples instead of taking place in the features of the matrix. – konstantin Jun 02 '17 at 09:13
  • 1
    @konstantin if I understand your question right , when we use .fit it just find the transform matrix without transforming data , you can use .transform or .fit_transform in first to transform your data , good luck bro – Javad Sameri Jun 02 '17 at 15:39
  • 1
    Dude it was my mistake, i thought that pca2.components_ were the transformed train data while it wasnt the case. – konstantin Jun 02 '17 at 16:06
  • 1
    @konstantin I thought you want to use it later :D – Javad Sameri Jun 02 '17 at 19:59