pass multiple datasets/arguments to FunctionTransformer in Pipeline

Question

I am trying to build a pipeline of transformations. I have an example below where undersample_train_set accepts three parameters: X is a dataframe of features, y is a np.array of labels and strategy_count is a dictionary of counts for each label. SMOTE_train_set accepts similar aparmeters except addition of cat_cols: array of categorical features and knn=1 for k_neighborhood.

I want to put these steps into a Pipeline but before that they are transformed using FunctionTransformer with the functions kargs and then I call them into the pipeline as you see.

However, the pipeline gives the error of: TypeError: undersample_train_set() missing 1 required positional argument: 'y'

I have been reading documents and examples on such as Documentation and stackoverflow and similar and found out every example only uses functions where only one X is called while I have X, y. is that the problem and reason Pipeline throws error? I tested FunctionsTransformer with fit to my X and y and it worked fine with the results expected but it wnt run in Pipeline. Any hint as where I am doing it wrong?

def undersample_train_set(X, y, strategy_count):

    under = RandomUnderSampler(sampling_strategy=strategy_count, random_state=42)
    X_resample, y_resample = under.fit_resample(X, y)
    return X_resample, y_resample

def SMOTE_train_set(X, y, cat_cols, strategy_count, knn):

    smote_nc = SMOTENC(categorical_features=cat_cols,
                        sampling_strategy=strategy_count,
                        random_state=1,
                        k_neighbors=knn)
    X_resample, y_resample = smote_nc.fit_resample(X, y)
     
    return X_resample, y_resample


transformer_under = FunctionTransformer(undersample_train_set,
                                        kw_args={'strategy_count': under_strategy_count})


transformer_SMOTE = FunctionTransformer(SMOTE_train_set,
                                        kw_args={'cat_cols': cat_cols_bool_arr,
                                        'strategy_count': SMOTE_strategy_count,
                                        'knn': 1})

# Pipleine
pipe_transformations = Pipeline([('under', transformer_under), ('smote', transformer_SMOTE)]).fit_transform(X, y)

sklearn pipelines won't let you modify `y`, but that's why `imblearn` provides its own `Pipeline` object. // Why the function wrappers on the `imblearn` transformers and those inside `FunctionTransformers, as opposed to directly using the transformer? — Ben Reiniger, Sep 16 '22 at 16:50
I guess we posted at the same time. yes, you are correct. I have the final code posted below. — baharak Al, Sep 16 '22 at 17:16

score 0 · Answer 1 · answered Sep 16 '22 at 17:14

I found that I needed to import Pipeline from imblearn instead of scikit thanks to this this git discussion

from sklearn.pipeline import Pipeline

vs

from imblearn.pipeline import Pipeline

then I modified code as below:

smote_nc = SMOTENC(categorical_features=cat_cols_bool_arr,
                        sampling_strategy=SMOTE_strategy_count,
                        random_state=1,
                        k_neighbors=1)

under = RandomUnderSampler(sampling_strategy=under_strategy_count)

x = Pipeline([('under', under), ('smote', smote_nc)]).fit_resample(X, y)

pass multiple datasets/arguments to FunctionTransformer in Pipeline

1 Answers1