Custom Transformers in Sklearn Pipeline do not work as expected

Question

I am working on ML project using sklearn. I have writtern few custom transformers as below:

DateTimeTransformer - To extract day, month, year, hour, minute, second (thereby getting 6 new columns) applied on Arrival Time

KBinTransformer - To turn continous into category [n_bins=3, encode='ordinal', strategy='uniform'] (thereby getting 1 new columns) applied on Age

I have a DataFrame like below:

Name (object), class (category), Age (int), Arrival Time (datetime)
-------------------------------------------------------------------
foo          | A               |  44       | 20/7/2023 4:15:2 
bar          | B               |  34       | 10/7/2023 2:10:5

df = pd.DataFrame() #  Contains above data in df

I have created a pipeline as below:

steps = {
    "date_time": DateTimeTransformer(),
    "k_bin": KBinTransformer(),
}

pipe = Pipeline(steps=steps)

pipe.fit(X=df)
pipe.transform(X=df)

The issue is when, in steps I put both(date_time and k_bin) and run it. I get output with DateTimeTransformer giving 12 (day, month, year, hour, miniute, second, day, month, year, hour, miniute, second) new columns (which is wrong expected 6 new columns) and KBinTransformer giving 1 new column.

I tried reversing the steps

steps = {
    "k_bin": KBinTransformer(),
    "date_time": DateTimeTransformer(),
}

Now for KBinTransformer giving 2 (age, age) new columns (which is wrong and expected 1 new column) and DateTimeTransformer giving 6 new columns.

What happening is input to next transformer is the output of previous transformer(including newly created columns + old unused columns) during the fit() function and calling actual transform() creates again those column thereby getting duplicate on final output.

But if I keep only one transformer in the pipe and run it, it gives correct output. I ran keeping DateTimeTransformer giving 6 new columns I ran keeping KBinTransformer giving 1 new column

What I am missing in using pipeline?

It is hard to tell without details, please provide your definitions of your custom transformers — DataJanitor, Jul 21 '23 at 07:18
Hi DataJanitor, Thanks. `ColumnTransformer` to the rescue, was not aware of this, just found and exploring it and feels it will help. — winter, Jul 21 '23 at 08:00

DataJanitor · Accepted Answer · 2023-07-21T09:50:29.680

If you use your transformers as steps in a pipeline, they will be applied one after the other on all columns.

I guess you do not want your transformers as steps, but as ColumnTransformer to transform only the columns based on the dtype. You can use make_column_selector to select the columns you want:

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np

ct = ColumnTransformer(
    transformers=[
        ('datetime', DateTimeTransformer(), make_column_selector(dtype_include=np.datetime64)), 
        ('kbin', KBinTransformer(), make_column_selector(dtype_include=np.number))
    ],
    remainder='passthrough')

df_transformed = ct.fit_transform(df)

Custom Transformers in Sklearn Pipeline do not work as expected

1 Answers1