-1

For stuff like LabelEncoder and the SimpleImputer from scikit-learn why do we use fit_transform for the X_train DataFrame and why do we use transform for the X_valid DataFrame?

eg

for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

What is the difference between the two in terms of how they work?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
NatMargo
  • 21
  • 3

2 Answers2

0

If you want to use the an imputer to fill some missing values in your training data using the median first you need to calculate what that median value is, this is what happens when you call fit().

Now you have the median value but you haven't altered your dataset, to do that you need to change (or transform) your dataset. This is what happens when you call transform(). Often you want to calculate a median value and use that median value to replace NaNs or some other non value, fit_transform() does both of the aforementioned steps in one go for convenience.

When you call fit() your imputer object saves the values that were fit, when you call transform on your test data, this value is use for imputation.

Going in back to your example. You use sklearn.preprocessing.LabelEncoder to convert strings to integers. You call fit() and then transform (or fit_transform()) on your training data to change strings to integers. Now you have your test data and you need to use the same approach to change the strings in your test data to integers, so you use the already fitted LabelEncoder object and only need to call transform() as the object has already been fit (or parameterized) on your training data.

Jason
  • 4,346
  • 10
  • 49
  • 75
0

label_encoder.fit make the label encoder ready and it has no output. Then you can use by label_encoder.transform(X). However, label_encoder.fit_transform make the encoder ready and then generates the output. In the other words:

label_X_train[col] = label_encoder.fit_transform(X_train[col])

is the same as

label_encoder.fit(X_train[col])
label_X_train[col] = label_encoder.transform(X_train[col])

For the validation dataset, you don't want to fit the label encoder again (because it is already fitted and it is ready), so you just use transform.

Reza
  • 1,945
  • 1
  • 9
  • 17