-2

how does fit() method works in sklearn.preproessing using Imputer class what does exactly fit() do in back ground how it is necessary for below code and everywhere im seeing fitting what fitting with what , why and how ?

from sklearn.preprocessing import Imputer
impt = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
impt = impt.fit(X[:,1:3])
X[:,1:3] = impt.transform(X[:,1:3])
Kapil
  • 1
  • This question is to broad vor stackoverflow, sorry – PV8 Jul 02 '19 at 08:59
  • 1
    Possible duplicate of [What does "fit" method in scikit-learn do?](https://stackoverflow.com/questions/45704226/what-does-fit-method-in-scikit-learn-do) – Dan Jul 02 '19 at 09:08

3 Answers3

1

The idea is to 'fit' your pre-processing on your training data only (as you would your model). It will learn some state, for the imputer this might be the mean of your feature. Then when you transform on your test / validation data, you use the state (i.e. the mean in this case) to impute the new unseen data. Using this design, it makes it really easy to avoid data leaks. Consider if you had imputed on your entire dataset. The mean that you use for the imputation now uses some of the information from your supposedly unsees test data. This is a data leak, your data is no longer truly unseen. Scikit-learn uses the fit / transform pattern to easily mitigate this common pitfall in machine learning.

Furthermore, because ALL sklearn transformers and estimators use this fit API, you can chain them up in a pipeline making it possible to do all your pre-processing easily on each fold of a k-fold cross-validation, which otherwise would be a very fiddly, tricky thing to do without errors.

Dan
  • 45,079
  • 17
  • 88
  • 157
1

Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

The above line creates an Imputer object which will impute/replace the missing values which are denoted as NaN's with the mean value of the values.

impt = impt.fit(X[:,1:3])

So it needs some data from which it can calculate mean which can be replaced by the missing values. This is normally done by a method fit which will calculate the values needed, mean in this case. The fit takes in some data to calculate these values and it is normally called the training phase.

impt.transform(X[:,1:3])

Once the values are calculated they can be used on the new data presented to it. In this case, it will replace the missing data with the calculated (in fit method ) mean. This is done via a transform method.

Sometimes one might want to run fit and transform of the same data. In such cases instead of calling fit followed by transform we can use fit_transform method.

X[:,1:3] = impt.fit_transform(X[:,1:3])

mujjiga
  • 16,186
  • 2
  • 33
  • 51
0

Well, the aim of "fit" in preprocessing stage is to compute the necessary values (like min and max of each variable). Then with this value scikit learn can then preprocess your data but it couldn't before. It is also useful because you can then re use your preprocessor object later.

You can also use fit_transform if you like to do these 2 steps in one.

Simon Delecourt
  • 1,519
  • 8
  • 13