I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.
Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.
The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.
I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).
1) Is this the right way to think about cross-validation?
2) I've been looking at the Pipeline
class in sklearn.pipeline
and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64
columns with the mean", "impute all other data with the mode", etc.
There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline
? Would I just make my own subclass of BaseEstimator
?
Any guidance here would be great!