If you want to use the an imputer to fill some missing values in your training data using the median first you need to calculate what that median value is, this is what happens when you call fit()
.
Now you have the median value but you haven't altered your dataset, to do that you need to change (or transform) your dataset. This is what happens when you call transform()
. Often you want to calculate a median value and use that median value to replace NaNs or some other non value, fit_transform()
does both of the aforementioned steps in one go for convenience.
When you call fit()
your imputer object saves the values that were fit, when you call transform
on your test data, this value is use for imputation.
Going in back to your example. You use sklearn.preprocessing.LabelEncoder
to convert strings to integers. You call fit()
and then transform
(or fit_transform()
) on your training data to change strings to integers. Now you have your test data and you need to use the same approach to change the strings in your test data to integers, so you use the already fitted LabelEncoder
object and only need to call transform()
as the object has already been fit (or parameterized) on your training data.