1

I have a df called X like this:

Index Class Family
1      Mid    12
2      Low     6
3      High    5
4      Low     2

Created this to dummy variables using below code:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder() 
X_object = X.select_dtypes('object')
ohe.fit(X_object)

codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])

X = pd.concat([df.select_dtypes(exclude='object'), 
               pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)

Resultant df is like:

V1_Mid   V1_Low   V1_High V2_12 V2_6 V2_5 V2_2
1          0        0      1     0    0    0

..and so on

Question: How to do I convert my resultant df back to original df ?

I have seen this but it gives me NameError: name 'Series' is not defined.

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51

2 Answers2

3

First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:

>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
  Class          Family         
    Mid Low High     12  6  5  2
0     1   0    0      1  0  0  0

Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():

>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
  Class Family
0   Mid     12
Cimbali
  • 11,012
  • 1
  • 39
  • 68
0

An even more simple way is to just stick to pandas.

df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column 
df["combined"] = list(zip(df["Class"], df["Family"]))

print(df)

Out:

   Index Class  Family   combined
0      1   Mid      12  (Mid, 12)
1      2   Low       6   (Low, 6)
2      3  High       5  (High, 5)
3      4   Low       2   (Low, 2)

You can get the one hot encoding using pandas directly.

one_hot = pd.get_dummies(df["combined"])
print(one_hot)

Out:

    (High, 5)  (Low, 2)  (Low, 6)  (Mid, 12)
0          0         0         0          1
1          0         0         1          0
2          1         0         0          0
3          0         1         0          0

Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.

print(df[df["combined"]==one_hot.columns[0]])

Out:

   Index Class  Family   combined
2      3  High       5  (High, 5)
Edoardo Guerriero
  • 1,210
  • 7
  • 16