Keeping track of the output columns in sklearn preprocessing

Question

How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:

What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)

I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.

score 3 · Accepted Answer · answered Nov 19 '19 at 05:41

The answer which had mentioned is based on this in Sklearn.

You can get the answer for your first two question using the following snippet.

def get_feature_names(columnTransformer):

    output_features = []

    for name, pipe, features in columnTransformer.transformers_:
        if name!='remainder':
            for i in pipe:
                trans_features = []
                if hasattr(i,'categories_'):
                    trans_features.extend(i.get_feature_names(features))
                else:
                    trans_features = features
            output_features.extend(trans_features)

    return output_features

import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
            columns=get_feature_names(preprocessor))

transformed_cols = get_feature_names(preprocessor)

def get_original_column(col_index):
    return transformed_cols[col_index].split('_')[0]

get_original_column(3)
# 'embarked'

get_original_column(0)
# 'age'

def get_category(col_index):
    new_col = transformed_cols[col_index].split('_')
    return 'no category' if len(new_col)<2 else new_col[-1]

print(get_category(3))
# 'Q'

print(get_category(0))
# 'no category'

Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.

I am extremely surprised by the fact that getting feature names is not a built-in functionality somewhere. Did people really not even care about tracking the column transformation process for the past 10 years or so? Nevertheless, I guess this is be the best we can do for now. — Bill Huang, Nov 19 '19 at 07:30
So far sklearn classes are designed in a way that common numpy arrays can work well for it. Now they are building these capabilities for all transformers. [Refer](https://github.com/scikit-learn/scikit-learn/issues/6425). Please accept the answer if you feel it serves your purpose. — Venkatachalam, Nov 19 '19 at 08:17
eli5 also has some existing implementation to get feature names for column transformers. May be [this](https://github.com/scikit-learn/scikit-learn/issues/12525) conversation can help. — Venkatachalam, Nov 19 '19 at 08:21
I am convinced by the threads you referred to that nobody could likely outsmart this solution, hence the acceptance. Thank you very much for both the implementation and explanation of current meta. Deserves 100 reps. — Bill Huang, Nov 19 '19 at 09:29

Keeping track of the output columns in sklearn preprocessing

1 Answers1