0

I'm learning to use pipelines as they look more clean. So, I'm working on the tabular playground competition on Kaggle.

I'm trying follow a pretty simple pipeline where I use a FunctionTransformer to add a new column to the dataframe, do Ordinal Encoding, and finally fit the data on a LinearRegression model.

Here is the code:

def weekfunc(df):
    print(df)
    df = pd.to_datetime(df)
    df['weekend'] = df.dt.weekday
    df['weekend'].replace(range(5), 0, inplace = True)
    df['weekend'].replace([5,6], 1, inplace = True)
​ get_weekend = FunctionTransformer(weekfunc)
 
col_trans = ColumnTransformer([
    ('weekend transform', get_weekend,['date']),
    ('label encoding', OrdinalEncoder(), ['country', 'store', 'product'])
])
 
pipe = Pipeline([
    ('label endoer', col_trans),
    ('regression', LinearRegression())
])
 
 
pipe.fit(X_train,y_train)

But the code breaks on the first step (FunctionTransformer) and gives me the following error:

to assemble mappings requires at least that [year, month, day] be specified: 
[day,month,year] is missing

which is weird since I can print inside the function being executed which shows it is in datetime format. Even get_weekend.transform(X_train['date']) works as intended. But it doesn't seem to work when all the steps are joined.

Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
default-303
  • 361
  • 1
  • 4
  • 15

1 Answers1

3

FunctionTransformer creates a scikit-learn compatible transformer using one custom function. A transformer must returns the result of this function so that it can be used in the latter step. The problem with your code is that weekfunc is basically taking in a DataFrame and returning nothing.

The example below is using the function weekfunc inside a pipeline:

As follow:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from datetime import datetime
import pandas as pd

X_train = pd.DataFrame(
    {
        "date": [datetime.today()],
        "country": ["ch"],
        "store": ["Gamestop"],
        "product": ["Xbox"],
    }
)
y_train = pd.Series([1])


def weekfunc(df):
    return (df["date"].dt.weekday >= 5).to_frame()


get_weekend = FunctionTransformer(weekfunc)

col_trans = ColumnTransformer(
    [
        ("weekend transform", get_weekend, ["date"]),
        ("label encoding", OrdinalEncoder(), ["country", "store", "product"]),
    ]
)

pipe = Pipeline([("preprocessing", col_trans), ("regression", LinearRegression())])


pipe.fit(X_train, y_train)

In addition, scikit-learn Pipeline not only looks cleaner. They also help to create model that are containing both preprocessing and modeling. This is very helpful to deploy model in production.

Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29
  • Will the `fit` method of `pipe` invoke the transform method of `get_weekend` – MSS Nov 25 '22 at 10:04
  • Yes the `fit()` function of `get_weekend` will be called. But since it is a `FunctionTransformer`, which is used to perform stateless transformations, its `fit()` function will do nothing. – Antoine Dubuis Nov 28 '22 at 06:42