Sklearn Pipeline / OneHotEncoder : consistency in getting categorical features with feature_names_in_ / get_feature_names_out()

Question

Similar questions have been asked before, but this is a particular case, and it seems that sklearn has evolved quite a bit since then (I am using scikit-learn 1.1.2), so I think it is worth a new post.

I created an sklearn Pipeline in which I apply different transformations to numeric and categorical columns, as below :

# Separate numeric columns and categorical columns
numeric_features = X_train.select_dtypes(exclude=['object']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

# Define transformer pipelines to be applied to each type of column
# 1. Apply KNNImputer to numeric columns
# 2. Apply OneHotEncoder to categorical columns
num_transform_pipeline = Pipeline(steps = [('imputer', KNNImputer(n_neighbors=1, weights="uniform"))])
cat_transform_pipeline = Pipeline(steps = [('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])

# Apply each transformer pipeline to each type of columns
column_transformer = ColumnTransformer(
    transformers=[
        ("num_column_transformer", num_transform_pipeline, numeric_features),
        ("cat_column_transformer", cat_transform_pipeline, categorical_features),
    ], verbose_feature_names_out = False
)

# Define the final pipeline combining column transformers and the regressor
pipeline = Pipeline([('column_transformer', column_transformer),
                     ('regressor', XGBRegressor())])

After loading the pipeline from another script, I am trying to find the categorical columns that are passed to the OneHotEncoder step. In the previous example, since OneHotEncoder is the first step of cat_transform_pipeline, I can't use get_feature_names_out() on the previous step.

However, I found two different ways of getting the list of categorical columns :

Accessing the last element of (name, fitted_transformer, column) in the second transformer of column_transformer returns the categorical columns :

cat_feature_names = pipeline['column_transformer'].transformers_[1][-1]

However, when I try to access the second transformer cat_column_transformer by its name :

cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer'][-1]

I get an error TypeError: 'OneHotEncoder' object is not iterable

Is there a way to achieve the same result by using the name of the transformer and not its index ?

Accessing OneHotEncoder's feature_names_in_ attribute does the job and seems to be the easiest method :

cat_feature_names = pipeline['column_transformer'].named_transformers_['cat_column_transformer']['onehotencoding'].feature_names_in_

However, when OneHotEncoder is not the first step of the pipeline, such as in the following case where an imputer is defined just before :

cat_transform_pipeline = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'most_frequent')),
                                           ('onehotencoding', OneHotEncoder(handle_unknown='ignore', sparse=False))])

I get the following error : AttributeError: 'OneHotEncoder' object has no attribute 'feature_names_in_'

The solution in this case is to use get_feature_names_out() on the previous step (the imputer). But that doesn't seem very consistent. Why would the attribute feature_names_in_ cease to exist when OneHotEncoder is preceded by an Imputer ?

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Oct 17 '22 at 16:20

Sklearn Pipeline / OneHotEncoder : consistency in getting categorical features with feature_names_in_ / get_feature_names_out()

0 Answers0