How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

Question

while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.

In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:

def impute_age_class(df, column_1, column_2):
  for i in range(len(df)):
    if np.isnan(df[column_1].iloc[i]):
        if df[column_2].iloc[i] == 1:
            df[column_1].iloc[i] = 38
        elif df[column_2].iloc[i] == 2:
            df[column_1].iloc[i] = 30
        else:
            df[column_1].iloc[i] = 25
  return df

age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]

It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:

Note message

I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.

score 0 · Answer 1 · answered Nov 11 '22 at 22:46

That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:

def impute_age_class(df, fillme, groupby):
    df = df.copy()
    df.loc[:, fillme] = df[fillme].fillna(
        value=df[groupby].map(
            {1: 38, 2: 30, 3: 25})
        )
    return df

tfmr = FunctionTransformer(
    impute_age_class,
    validate=False,
    kw_args={'fillme': 'age', 'groupby': 'pclass'}
)

It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.

How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

1 Answers1