4

This is a follow up question to:

PCA Dimensionality Reduction

In order to classify the new 10 dimensional test data do I have to reduce the training data down to 10 dimensions as well?

I tried:

X = bsxfun(@minus, trainingData, mean(trainingData,1));           
covariancex = (X'*X)./(size(X,1)-1);                 
[V D] = eigs(covariancex, 10);   % reduce to 10 dimension
Xtrain = bsxfun(@minus, trainingData, mean(trainingData,1));  
pcatrain = Xtest*V;

But using the classifier with this and the 10 dimensional testing data produces very unreliable results? Is there something that I am doing fundamentally wrong?

Edit:

X = bsxfun(@minus, trainingData, mean(trainingData,1));           
covariancex = (X'*X)./(size(X,1)-1);                 
[V D] = eigs(covariancex, 10);   % reduce to 10 dimension
Xtrain = bsxfun(@minus, trainingData, mean(trainingData,1));  
pcatrain = Xtest*V;

X = bsxfun(@minus, pcatrain, mean(pcatrain,1));           
covariancex = (X'*X)./(size(X,1)-1);                 
[V D] = eigs(covariancex, 10);   % reduce to 10 dimension
Xtest = bsxfun(@minus, test, mean(pcatrain,1));  
pcatest = Xtest*V;
Community
  • 1
  • 1
user3094936
  • 263
  • 6
  • 12

1 Answers1

7

You have to reduce both training and test data, but both in the same way. So once you got your reduction matrix from PCA on the training data, you have to use this matrix to reduce dimensionality of the test data. In short words, you need one, constant transformation which is applied to both training and testing elements.

Using your code

% first, 0-mean data
Xtrain = bsxfun(@minus, Xtrain, mean(Xtrain,1));           
Xtest  = bsxfun(@minus, Xtest, mean(Xtrain,1));           

% Compute PCA
covariancex = (Xtrain'*Xtrain)./(size(Xtrain,1)-1);                 
[V D] = eigs(covariancex, 10);   % reduce to 10 dimension

pcatrain = Xtrain*V;
% here you should train your classifier on pcatrain and ytrain (correct labels)

pcatest = Xtest*V;
% here you can test your classifier on pcatest using ytest (compare with correct labels)
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • 1
    no, it is completely wrong. You are supposed to run PCA just **once**. Take the matrix `V` and use it for scaling all the data, both training and testing. – lejlot Dec 14 '13 at 15:18
  • In pseudocode it should look like this: you have (I assume 0-mean data) `Xtrain, Ytrain, Xtest, Ytest`: `V = PCA(Xtrain); Xtrain_pca = Xtrain * V; model.fit( Xtrain_pca, Ytrain ); Xtest_pca = Xtest * V; model.test(Xtest_pca);` – lejlot Dec 14 '13 at 15:27
  • So I need to completely change it? What are Ytrain and Ytest? – user3094936 Dec 14 '13 at 15:39
  • I don't know if one can call changing 10 lines "a complete change", but yes. Ytrain and Ytest are my names for label vectors. But they do not really matter, this is just a pseudocode showing you the procedure – lejlot Dec 14 '13 at 16:06
  • The only thing with that is there is no Ytest i.e. test labels? – user3094936 Dec 15 '13 at 15:34
  • Also the classifier classifies the samples stored in testData and returns the corresponding estimated testLabels. It takes 3 arguments: The matrix testData holds the feature vectors that are to be classified where each sample is stored as a separate row. The matrix trainData is the training data feature vectors with each sample being stored as a matrix row. trainLabels is a vector of integer class labels corresponding to the data in trainData. – user3094936 Dec 15 '13 at 15:39
  • You seem to have problems with basic understanding of the classification process. I suggest you to ask a separate question about the basics, as your comments show serious lacks of understanding of what is going on. – lejlot Dec 15 '13 at 17:03
  • Not really. Gather a load of samples and attach known labels to them i.e. training data and training labels; take samples you want to classify and compare them with said training data and training labels to produce testing labels for them. – user3094936 Dec 15 '13 at 17:48
  • @lejlot Why I think you should subtract the mean of `Xtrain` from `Xtest` to ensure this is the same transformation? – Ray Dec 20 '13 at 16:25
  • @lejlot thanks a lot! For the covariance calculation, a lot of memory is required when using high dimensional training data. Is there a way to use e.g. the `pca` function to achieve the same but more efficient in terms of memory usage? – bonanza Aug 10 '16 at 07:44