4

So I've gotten myself a little confused.

At the moment, I've got a dataset of about 800 instances. I've split it into a training and validation set because there were missing values so I used SimpleImputer from sklearn and fit_transform-ed the training set and transformed the testing set. I did that because if I want to predict for new instances, if there's missing values then I'll need to impute it the same way I imputed the test set.

Now I want to use cross validation to train and score models, but that would involve using the whole dataset and splitting it up into different training and testing sets, so then I'm worried about leakage from the training set because of the imputed values being fitted?

bunbun
  • 2,595
  • 3
  • 34
  • 52
Alexia M
  • 41
  • 1

1 Answers1

1

Generally, you'll want to split your data into three sets- a training set, testing set, and validation set. The testing set should be completely left out of training (your concern is correct.) When using cross validation, you don't need to worry about splitting your training and validation set- that's what cross validation does for you! Simply pass the training set to the cross validator, allow it to split into training and validation behind the scenes, and test the final model on your testing set (which has been completely left out of the training process.)

Brandon Schabell
  • 1,735
  • 2
  • 10
  • 21
  • But if I do the imputation before running the CV, then information from the different validation sets will automatically be flowing into the training sets. I think I would need to do the imputation for each fold again. So if I have a 5 fold CV, I will have 5 training and validation sets. I will need to compute the imputation values on each o f the train sets and apply them to the validation sets. Is that right? – Simon Hessner Aug 27 '20 at 17:27
  • 2
    Ideally yes, you'll want to impute at each different fold. Scikit-learn allows you to do this by using pipelines so you can stack all your preprocessors, imputers and models into your CV. – user1903753 Dec 01 '20 at 11:47
  • How can I do this in R? @user1903753 and how large should my dataset be if it's going to have so many subsets of the original? – Antonio Sep 20 '22 at 01:45