SKLearn - Principal Component Analysis leads to horrible results in knn predictions

Question

by adding PCA to the algorithm, I'm working to improve %96.5 SKlearn kNN prediction score for kaggle digit recognition tutorial, yet new kNN predictions based on PCA output are horrible like 23%.

below is the full code and i appreciate if you point out where i am mistaken.

import pandas as pd
import numpy as np
import pylab as pl
import os as os
from sklearn import metrics
%pylab inline
os.chdir("/users/******/desktop/python")

traindata=pd.read_csv("train.csv")
traindata=np.array(traindata)
traindata=traindata.astype(float)
X,y=traindata[:,1:],traindata[:,0]

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.25, random_state=33)

#scale & PCA train data
from sklearn import preprocessing
from sklearn.decomposition import PCA
X_train_scaled = preprocessing.scale(X_train)
estimator = PCA(n_components=350)
X_train_pca = estimator.fit_transform(X_train_scaled)

# sum(estimator.explained_variance_ratio_) = 0.96

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=6)
neigh.fit(X_train_pca,y_train)

# scale & PCA test data
X_test_scaled=preprocessing.scale(X_test)
X_test_pca=estimator.fit_transform(X_test_scaled)

y_test_pred=neigh.predict(X_test_pca)
# print metrics.accuracy_score(y_test, y_test_pred) = 0.23
# print metrics.classification_report(y_test, y_test_pred)

YS-L · Accepted Answer · 2014-01-25T07:59:34.550

19

When you are processing the test data, you used fit_transform(X_test) which actually recomputes another PCA transformation on the test data. You should be using transform(X_test), so that the test data undergoes the same transformation as the training data.

The portion of code will look something like (thanks ogrisel for the whiten tip):

estimator = PCA(n_components=350, whiten=True)
X_train_pca = estimator.fit_transform(X_train)
X_test_pca = estimator.transform(X_test)

Try and see if it helps?

edited Jan 25 '14 at 07:59

answered Jan 24 '14 at 11:42

YS-L

14,358
3
47
58

1

Also there is no need to scale the PCA transformed data. It's possible to pass the `whiten=True` param to the PCA constructor to get the same result. – ogrisel Jan 24 '14 at 11:47
Gentlemen,following both reply, code was altered as follows:`estimator = PCA(n_components=350,whiten=True) estimator.fit(X_train) X_train_pca=estimator.transform(X_train) .... estimator.fit(X_test) X_test_pca=estimator.transform(X_test)` yet prediction score dropped down to 11%. any suggestion? (sorry for the messy comments as i cant add line breaks to comments – kannbaba Jan 25 '14 at 07:34
3

As you do ``estimator.fit(X_test)`` again, your new code is still transforming the training and testing data differently, i.e. more or less the same as the original version. There should only be a single call to ``fit`` for the PCA transformer. – YS-L Jan 25 '14 at 07:54

score 0 · Answer 2 · edited Aug 18 '17 at 03:44

0

You have to:

fit and transform(using .fit_transfrom) on training set
and only transform(using .transform) on your test set.

edited Aug 18 '17 at 03:44

Stephen Rauch

47,830
31
106
135

answered Aug 18 '17 at 03:23

Thierry K.

97
1
6

SKLearn - Principal Component Analysis leads to horrible results in knn predictions

2 Answers2