I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it?
In the below code df_columns
would tell me that column 0
in df_array
is A
, column 1
is B
and so forth...
However when once I encode categorical column B
df_columns
is no longer valid for keeping track of df_dummies
import pandas as pd
import numpy as np
animal = ['dog','cat','horse']
df = pd.DataFrame({'A': np.random.rand(9),
'B': [animal[np.random.randint(3)] for i in range(9)],
'C': np.random.rand(9),
'D': np.random.rand(9)})
df_array = df.values
df_columns = df.columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
df_dummies = np.array(ct.fit_transform(df_array), dtype=np.float)
The solution should be agnostic of the position of the categorical column... be it A
, B
, C
or D
. I can do the grunt work and keep updating the df_columns
dictionary... but it wouldn't be elegant or "pythonic"
Furthermore... how would the solution look to keep track of what the categoricals mean? {0,0,1} would be cat, {0,1,0} would be dog and so on?
PS - I am aware of the dummy variable trap and will take df_dummies[:,1:]
when I actually use it to train my model.