In order to do proper CV it is advisable to use pipelines so that same transformations can be applied to each fold in the CV. I can define custom transformations by using either sklearn.preprocessing.FunctionTrasformer
or by subclassing sklearn.base.TransformerMixin
. Which one is the recommended approach? Why?

- 8,547
- 1
- 25
- 51

- 581
- 1
- 4
- 13
2 Answers
Well it is totally upto you, both will achieve the same results more or less, only the way you write the code differs.
For instance, while using sklearn.preprocessing.FunctionTransformer
you can simply define the function you want to use and call it directly like this (code from official documentation)
def all_but_first_column(X):
return X[:, 1:]
def drop_first_component(X, y):
"""
Create a pipeline with PCA and the column selector and use it to
transform the dataset.
"""
pipeline = make_pipeline(PCA(), FunctionTransformer(all_but_first_column),)
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
return pipeline.transform(X_test), y_test
On the other hand, while using subclassing sklearn.base.TransformerMixin
you will have to define the whole class along with the fit
and transform
functions of the class.
So you will have to create a class like this(Example code take from this blog post)
class FunctionFeaturizer(TransformerMixin):
def __init__(self, *featurizers):
self.featurizers = featurizers
def fit(self, X, y=None):
return self
def transform(self, X):
#Do transformations and return
return transformed_data
So as you can see, TransformerMixin
gives you more flexibility as compared to FunctionTransformer with regard to transform function. You can apply multiple trasnformations, or partial transformation depending on the value, etc. An example can be like, for the first 50 values you want to log while for the next 50 values you wish to take inverse log and so on. You can easily define your transform method to deal with data selectively.
If you just want to directly use a function as it is, use sklearn.preprocessing.FunctionTrasformer
, else if you want to do more modification or say complex transformations, I would suggest subclassing sklearn.base.TransformerMixin
Here, take a look at the following links to get a more better idea

- 1,225
- 1
- 11
- 18

- 8,547
- 1
- 25
- 51
-
1Don't you have to inherit from both `BaseEstimator` and `TransformerMixin` to make `FunctionFeaturizer` work properly? – actual_panda Apr 23 '20 at 06:52
The key difference between FunctionTransformer
and a subclass of TransformerMixin
is that with the latter, you have the possibility that your custom transformer can learn by applying the fit
method.
E.g. the StandardScaler
learns the means and standard deviations of the columns during the fit
method, and in the transform
method these attributes are used for transformation. This cannot be achieved by a simple FunctionTransformer
, at least not in a canonical way as you have to pass the train set somehow.
This possibility to learn is in fact the reason to use custom transformers and pipelines - if you just apply an ordinary function by the usage of a FunctionTransformer
, nothing is gained in the cross validation process. It makes no difference whether you transform before the cross validation once or in each step of the cross validation (except that the latter will take more time).

- 481
- 3
- 14
-
2This should be the accepted answer. `FunctionTransformer` is only indicated for stateless transformations. – Tulio Casagrande Feb 24 '21 at 18:26