Unabl to use Lambda in Scikit learn Pipeline

Question

I have a pipeline which uses lambda functions:

preprocess_ppl = ColumnTransformer(
    transformers=[
        ('encode', categorical_transformer, make_column_selector(dtype_include=object)),
        ('zero_impute', fill_na_zero_transformer, lambda X: [col for col in fill_zero_cols if col in X.columns] ),
        ('numeric', numeric_transformer, lambda X: [col for col in num_cols if col in X.columns])
    ]
)
pipeline2 = Pipeline(
    steps=[
        ('dropper', drop_cols),
        ('remover',feature_remover),
        ("preprocessor", preprocess_ppl),
        ("estimator", customOLS(sm.OLS))
        ]
)

Basically, the lambda functions selects/subsets the columns only if the columns are present in X. Sometimes some columns are removed by intermediate step and it is possible that the a column in num_cols was removed hence I use lambda function to select only the present columns.

The problem is, the lambda function is not serializable and I have to use pickle I cannot use dill. Is there any other way of doing these lamda functions?

Define named functions, with `def`. (You may need to make those definitions available when unpickling, I'm not really sure.) — Ben Reiniger, Aug 17 '22 at 18:37

nferreira78 · Accepted Answer · 2022-08-12T10:58:23.010

0

Don't use lambda, just use the list fill_zero_cols for 'zero_impute' and num_cols for 'numeric'.

After all your lambda is just checking if each of the column names are in X.columns before processing. But I'm sure that if you try to process an input with missing features, your model will break anyway.

So, your only solution is just pre-defining a list with each column types you want to process. This will give you consistent results.

You have to ensure that your functions will fill the null values. Be extremely careful on how you fill those gaps in the data that doesn't exist and your concept has to be valid.

Ensure you pre-process your data (fill the nulls) before you train the model or in your pipeline, using the callable function

edited Aug 12 '22 at 10:58

answered Aug 12 '22 at 10:15

nferreira78

1,013
4
17

Hi, This is what I do not want to do! lets say the num_cols = ['a','b','c','d'] the remove_cols can remove any of them based on data and some logic, if I pass the num_cols as is, it will result in error saying: A column does ot exists in data – Obiii Aug 12 '22 at 10:43
Yes, but what you want to do doesn't mean is the correct thing to do. You have to be extremely careful on how you fill the gaps for the data that doesn't exist and your concept has to be valid. Also, what is `remove_cols`? How are you expecting it to remove columns? If you get that error message, you have to pre-process your data (fill the nulls) before you train the model (and before you make inference) – nferreira78 Aug 12 '22 at 10:54
remove_cols is a custom transformer that remove cols having missing percentage more than 40. – Obiii Aug 12 '22 at 11:05
You cannot remove cols at your own will. You either fill the values missing using some criteria (which needs validation), or don't use that column at all. You can only pass a fixed number of features. Using a threshold of 40% is just a random criteria. – nferreira78 Aug 12 '22 at 12:14
"or don't use that column at all" is exactly what Obiii is trying to do. "Using a threshold of 40% is just a random criteria" may be true, or it may be a well-informed threshold; that starts to be off-topic here though. – Ben Reiniger Aug 17 '22 at 18:33

Unabl to use Lambda in Scikit learn Pipeline

1 Answers1