So I've gotten myself a little confused.
At the moment, I've got a dataset of about 800 instances. I've split it into a training and validation set because there were missing values so I used SimpleImputer from sklearn and fit_transform-ed the training set and transformed the testing set. I did that because if I want to predict for new instances, if there's missing values then I'll need to impute it the same way I imputed the test set.
Now I want to use cross validation to train and score models, but that would involve using the whole dataset and splitting it up into different training and testing sets, so then I'm worried about leakage from the training set because of the imputed values being fitted?