I have scikitlearn pipeline and I intend to encode a categorical feature. But problem is I have another step before this encoding which deletes the feature based on some logic and in the encoding step I want to encode only if there is the feature existing even after removel.
here is the code I have:
preprocess_ppl = ColumnTransformer(
transformers=[
('categorical', categorical_transformer, ['MARITAL_STATUS']),
('zero_impute', fill_na_zero_transformer, lambda X: [col for col in fill_zero_cols if col in X.columns] ),
('numeric', numeric_transformer, lambda X: [col for col in num_cols if col in X.columns])
]
)
pipeline2 = Pipeline(
steps=[
('dropper', drop_cols),
('remover',feature_remover),
("preprocessor", preprocess_ppl),
("estimator", customOLS(sm.OLS))
]
)
categorical_transformer = Pipeline(steps=[
('categorical_imputer', SimpleImputer(strategy="constant", fill_value='Unknown')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocess_ppl = ColumnTransformer(
transformers=[
('categorical', categorical_transformer, ['MARITAL_STATUS']),
('zero_impute', fill_na_zero_transformer, lambda X: [col for col in fill_zero_cols if col in X.columns] ),
('numeric', numeric_transformer, lambda X: [col for col in num_cols if col in X.columns])
]
)
Sometime the dropper or remover step removes the Marital Status feature and thus the pipeline gives error that the column in not present in the data.
Is there any way to do this?