Similar questions have been asked before, but this is a particular case, and it seems that sklearn has evolved quite a bit since then (I am using scikit-learn 1.1.2), so I think it is worth a new post.
I created an sklearn Pipeline in which I apply different transformations to numeric and categorical columns, as below :
# Separate numeric columns and categorical columns
numeric_features = X_train.select_dtypes(exclude=['object']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
# Define transformer pipelines to be applied to each type of column
# 1. Apply KNNImputer to numeric columns
# 2. Apply OneHotEncoder to categorical columns
num_transform_pipeline = Pipeline(steps = [('imputer', KNNImputer(n_neighbors=1, weights="uniform"))])
cat_transform_pipeline = Pipeline(steps = [('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])
# Apply each transformer pipeline to each type of columns
column_transformer = ColumnTransformer(
transformers=[
("num_column_transformer", num_transform_pipeline, numeric_features),
("cat_column_transformer", cat_transform_pipeline, categorical_features),
], verbose_feature_names_out = False
)
# Define the final pipeline combining column transformers and the regressor
pipeline = Pipeline([('column_transformer', column_transformer),
('regressor', XGBRegressor())])
After loading the pipeline from another script, I am trying to find the categorical columns that are passed to the OneHotEncoder step. In the previous example, since OneHotEncoder is the first step of cat_transform_pipeline
, I can't use get_feature_names_out() on the previous step.
However, I found two different ways of getting the list of categorical columns :
- Accessing the last element of (name, fitted_transformer, column) in the second transformer of
column_transformer
returns the categorical columns :
cat_feature_names = pipeline['column_transformer'].transformers_[1][-1]
However, when I try to access the second transformer cat_column_transformer
by its name :
cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer'][-1]
I get an error TypeError: 'OneHotEncoder' object is not iterable
Is there a way to achieve the same result by using the name of the transformer and not its index ?
- Accessing OneHotEncoder's
feature_names_in_
attribute does the job and seems to be the easiest method :
cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer']['onehotencoding'].feature_names_in_
However, when OneHotEncoder is not the first step of the pipeline, such as in the following case where an imputer is defined just before :
cat_transform_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])
I get the following error : AttributeError: 'OneHotEncoder' object has no attribute 'feature_names_in_'
The solution in this case is to use get_feature_names_out()
on the previous step (the imputer). But that doesn't seem very consistent. Why would the attribute feature_names_in_
cease to exist when OneHotEncoder is preceded by an Imputer ?