0

I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:

X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)

Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):

params = {
    'linearsvc__C' : [...], 
    'linearsvc__tol' : [...],
    'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)

My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.

Does sklearn account for this?

And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:

clf = make_pipeline(SMOTE(), SVC(kernel='linear'))

still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here. Does sklearn/imblearn account for this as well?

Many thanks in advance

Asduffo
  • 93
  • 1
  • 6
  • Does this help? https://stackoverflow.com/questions/61453795/using-sklearns-randomizedsearchcv-with-smote-oversampling-only-on-training-fold – desertnaut Apr 28 '20 at 14:32
  • If you read the article I linked in my question that @desertnaut has linked above, you will see that SMOTE is not applied to the validation set. However, there are of course instances where you would need the preproccessing steps of the pipeline applied to all of your data (StandardScaler for example), and I don't know how you can specify to differentiate between these two cases. – KOB Apr 28 '20 at 15:20
  • perfect. Thanks to both of you! – Asduffo Apr 30 '20 at 14:04

0 Answers0