I am learning how to prepare data, build estimators and check using a train/test data split.
My question is how I can prepare the test dataset correctly.
I split my data into a test and a training set. And as "Hands on with machine learning with Scikit-Learn" teaches me, I set up a pipeline for my data preparation:
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
After training my estimator, I want to use my trained estimator on test data to validate my accuracy. However if I pass my test feature data through the pipeline I defined, isn't it calculating a new median value from only the test dataset and the std_scalar based on the test dataset which will be different values to what were arrived at in the training dataset?
I presume for consistency I want to re-use the variables achieved during training. That is what the estimator has been fitted on. For example, if the test set was just a single row (or in production I have a single input I want to derive a prediction from), then the median values wouldn't even be achievable if the single input has a NaN!
What step am I missing?