How to keep track of columns after encoding categorical variables?

Question

I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it?

In the below code df_columns would tell me that column 0 in df_array is A, column 1 is B and so forth...

However when once I encode categorical column B df_columns is no longer valid for keeping track of df_dummies

import pandas as pd
import numpy as np

animal = ['dog','cat','horse']

df = pd.DataFrame({'A': np.random.rand(9),
                   'B': [animal[np.random.randint(3)] for i in range(9)],
                   'C': np.random.rand(9),
                   'D': np.random.rand(9)})

df_array = df.values
df_columns = df.columns

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
df_dummies = np.array(ct.fit_transform(df_array), dtype=np.float)

The solution should be agnostic of the position of the categorical column... be it A, B, C or D. I can do the grunt work and keep updating the df_columns dictionary... but it wouldn't be elegant or "pythonic"

Furthermore... how would the solution look to keep track of what the categoricals mean? {0,0,1} would be cat, {0,1,0} would be dog and so on?

PS - I am aware of the dummy variable trap and will take df_dummies[:,1:] when I actually use it to train my model.

Nick Kharas · Accepted Answer · 2020-02-13T05:29:57.500

Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

First use get_dummies() to one-hot-encode the new data set.
Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Hey @RealRageDontQuit - I have edited my response to include coding examples, as well as further resources for helpful pandas functions. Does this answer your question? I assumed that you want to save column definitions from model training and then apply them on testing data sets as well as future unseen data. However, let me know if I missed anything. — Nick Kharas, Feb 13 '20 at 05:33

How to keep track of columns after encoding categorical variables?

1 Answers1

Linked