How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer
? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
- What is the source variable of each column in the output array?
- If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
- What is the exact imputed value for each variable?
- What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame
with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.