Split X into test/train before pre-processing and dimension reduction or after? Machine Learning

Question

I have been completing Microsoft's course DAT210X - Programming with Python for Data Science.

When creating SVC models for Machine Learning we are encouraged to split out the dataset X into test and train sets, using train_test_split from sci-kit learn, before performing preprocessing e.g. scaling and dimension reduction e.g. PCA/Isomap. I include a code example, below, of part of a solution i wrote to a given problem using this way of doing things.

However, it appears to be much faster to preprocess and PCA/IsoMap on X before splitting X out into test and train and there was a higher accuracy score.

My questions are:

1) Is there a reason why we can't slice out the label (y) and perform pre-processing and dimension reduction on all of X before splitting out to test and train?

2) There was a higher score with pre-processing and dimension reduction on all of X (minus y) than for splitting X and then performing pre-processing and dimension reduction. Why might this be?

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30, random_state=7)

step_c = .05
endpt_c = 2 + step_c
startpt_c = .05

step_g = .001
endpt_g = .1 + step_g
startpt_g = .001

bestscore = 0.0
best_i = 0.0
best_j = 0.0

pre_proc = [
        preprocessing.Normalizer(),
        preprocessing.MaxAbsScaler(),
        preprocessing.MinMaxScaler(),
        preprocessing.KernelCenterer(), 
        preprocessing.StandardScaler()
       ]
best_proc = ''
best_score = 0

print('running......')

# pre-processing (scaling etc)
for T in pre_proc: 
    X_train_T = T.fit_transform(X_train) 
    X_test_T =  T.transform(X_test) # only apply transform to X_test!

    # dimensionality reduction
    for k in range(2, 6):
        for l in range(4, 7):
            iso = Isomap(n_neighbors = k, n_components = l)
            X_train_iso = iso.fit_transform(X_train_T)
            X_test_iso = iso.transform(X_test_T)

            # SVC parameter sweeping
            for i in np.arange(startpt_c,endpt_c, step_c):
                # print(i)
                for j in np.arange(startpt_g,endpt_g, step_g):

                    clf = SVC(C=i, gamma=j , kernel='rbf'
                    # max_iter=-1, probability=False, random_state=None,   shrinking=True, tol=0.001, verbose=False)
                )
                    clf.fit(X_train_iso, y_train) 
                    score = clf.score(X_test_iso, y_test)

                    if bestscore < score:
                        bestscore = score
                        best_c = i
                        best_g = j
                        best_proc = T
                        best_n_neighbors = k
                        best_n_components = l

# Print final variables that gave best score:
print('proc: ' + str(T), 'score:' + str(bestscore), 'C: ' + str(i), 'g: ' + str(j), 'n_neigh: ' + str(k), 'n_comp: ' + str(l))enter code here

score 9 · Accepted Answer · answered Aug 11 '17 at 16:52

Regarding

1) Is there a reason why we can't slice out the label (y) and perform pre-processing and dimension reduction on all of X before splitting out to test and train?

The reason is that you should train your model on the training data, without using any information regarding the test data. If you apply PCA on the whole data (including the test data) before training the model, then you in fact use some information from the test data. Thus, you cannot really judge the behaviour of your model using the test data, because it is not an unseen data anymore.

Regarding:

2) There was a higher score with pre-processing and dimension reduction on all of X (minus y) than for splitting X and then performing pre-processing and dimension reduction. Why might this be?

This makes complete sense. You used some information from the test data to train the model, so it makes sense that the score on the test data would be higher. However, this score does not really give an estimate of the model's behaviour on unseen data anymore.

Split X into test/train before pre-processing and dimension reduction or after? Machine Learning

1 Answers1