4

I'm using GridsearchCV for tuning hyperparameters and now I want to do a min-max Normalization(StandardScaler()) in training and validating step.But I think I cannot do this.

The question is :

  1. If I apply preprocess step on whole training set and send it to GridsearchCV for do 10 foldCV. This gonna lead me to data leakage right? because the training set will running 10 folds this mean 9 folds for train and 1 fold for test fold. the Normalization should apply on only training set not validation set right?
  2. If I use sklearn's Pipeline it won't solve this problem right? because it runs only once and lead me to data leakage again.
  3. Is there other way to do this and still using the GridsearchCV for tuning the parameters
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

1 Answers1

2

Indeed this will cause a data-leak, it's very good that you caught it !

A solution to this using a pipeline, is to make a pipeline with StandardScaler as the first operation in the pipeline, and then your Classifier of choice and eventually pass this pipeline to the GridSearchCV

clf = make_pipeline(StandardScaler(), 
                    MyClassifier())
grid_search = GridSearchCV(clf, refit=True)

For more info, check this article here

Ahmed Ragab
  • 836
  • 5
  • 10
  • 1
    Ty for answer my question. Does the StandardScaler() perform every time when the fold have change or it performs only the first time (entire dataset). Is there anyway to do these strategy and prevent the data-leak? – Puntawat Ponglertnapakorn Apr 15 '19 at 15:57
  • Yes, given my code above, it will rerun the scaler on every fold, and if you have refit=True, that would eventually run the scaler on all the data alongside the model with the best hyperparameters. This should solve the data-leak problem. – Ahmed Ragab Apr 15 '19 at 16:02
  • Sound great. But how do you know it's will rerun every fold. I mean the StandardScaler() should run 2 times 1. for training set and 2. for validation set and these two step should run 10 time for 10 folds strategy. Am I right? – Puntawat Ponglertnapakorn Apr 15 '19 at 16:07
  • 4
    What will happen, is that for each fold the standardscaler will be fitted on the training portion, and then will be used to transform the testing portion using the statistics calculated on the training. There is no validation here, for each fold it only have a training and testing split. – Ahmed Ragab Apr 15 '19 at 16:10
  • I have added an article to my answer that should help. – Ahmed Ragab Apr 15 '19 at 16:11
  • You are right. I just split the unseen test set out from my data and pick the rest to training set so performing 10 fold. So I call it as "Training fold" and "Validation fold". Anyway the StandardScaler() will re-run everytime as you said on the training fold and transform to validation fold(aka test fold). If i right this will prevent the data-leak problem. Thank you very much – Puntawat Ponglertnapakorn Apr 15 '19 at 16:22