10

From a dataframe with numerical and nominal data:

>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
         'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
         'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

get_dummies convert categorical variable into dummy/indicator variables:

>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
   Budget  m_M1  m_M2  m_M7  qj_q23  qj_q4  qj_q9
0      39     1     0     0       1      0      0
1      15     0     1     0       0      1      0
2      13     0     0     1       0      0      1
3      53     1     0     0       1      0      0
4      82     0     1     0       1      0      0
5      70     1     0     0       0      0      1

What's the most elegant back_from_dummies way to get back from df_dummies to df ?

>>> (back_from_dummies(df_dummies) == df).all()
Budget    True
m         True
qj        True
dtype: bool
user3313834
  • 7,327
  • 12
  • 56
  • 99

3 Answers3

5

idxmax will do it pretty easily.

from itertools import groupby

def back_from_dummies(df):
    result_series = {}

    # Find dummy columns and build pairs (category, category_value)
    dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

    # Find non-dummy columns that do not have a _
    non_dummy_cols = [col for col in df.columns if "_" not in col]

    # For each category column group use idxmax to find the value.
    for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

        #Select columns for each category
        dummy_df = df[[col[1] for col in cols]]

        # Find max value among columns
        max_columns = dummy_df.idxmax(axis=1)

        # Remove category_ prefix
        result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

    # Copy non-dummy columns over.
    for col in non_dummy_cols:
        result_series[col] = df[col]

    # Return dataframe of the resulting series
    return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()
David Maust
  • 8,080
  • 3
  • 32
  • 36
3

Firstly, seperate the columns:

In [11]: from collections import defaultdict
         pos = defaultdict(list)
         vals = defaultdict(list)

In [12]: for i, c in enumerate(df_dummies.columns):
             if "_" in c:
                 k, v = c.split("_", 1)
                 pos[k].append(i)
                 vals[k].append(v)
             else:
                 pos["_"].append(i)

In [13]: pos
Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]})

In [14]: vals
Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']})

This allows you to slice into the different frames for each dummied column:

In [15]: df_dummies.iloc[:, pos["m"]]
Out[15]:
   m_M1  m_M2  m_M7
0     1     0     0
1     0     1     0
2     0     0     1
3     1     0     0
4     0     1     0
5     1     0     0

Now we can use numpy's argmax:

In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1)
Out[16]: array([0, 1, 2, 0, 1, 0])

*Note: pandas idxmax returns the label, we want the position so that we can use Categoricals.*

In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"])
Out[17]:
[M1, M2, M7, M1, M2, M1]
Categories (3, object): [M1, M2, M7]

Now we can put this all together:

In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals})

In [22]: df
Out[22]:
    m   qj
0  M1  q23
1  M2   q4
2  M7   q9
3  M1  q23
4  M2  q23
5  M1   q9

and putting back the non-dummied columns:

In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]

In [24]: df
Out[24]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

As a function:

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

In [31]: reverse_dummy(df_dummies)
Out[31]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
2

Similar to @David, I find that idxmax will do most of the work for you. I think there is no foolproof way to guarantee that you don't have issues when you are trying to convert columns back, however, because in some cases it can be tricky to identify which columns are dummies and which are not. I find that this can be greatly alleviated by using a separator that is very unlikely to occur in your data by chance. _ is often used in column names that have multiple words, so I use __ (double underscore) as a separator; I've never run into this in a column name in the wild.

Also, note that pd.get_dummies will move all of the dummy columns to the end. This means that you can't necessarily get the original order of the columns back out.

Here's an example of my approach. You can recognize the dummy columns as those with sep in them. We get the groups of dummy columns using df.filter which will let us match column names using regular expressions (just the part of the name before sep works; there are other ways you could do this part too).

The rename part strips off the beginning of the column names (e.g. m__) so that the remaining part is the value. Then idxmax extracts the column name that has a 1 in it. This gives us the dataframe from undoing pd.get_dummies on one of the original columns; we concatenate together the dataframes from reversing pd.get_dummies on each of the columns, together with other_cols - those columns that weren't "dummified".

In [1]: import pandas as pd

In [2]: df = pd.DataFrame.from_dict({'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
   ...:          'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
   ...:          'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}})

In [3]: df
Out[3]: 
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

In [4]: sep = '__'

In [5]: dummies = pd.get_dummies(df, prefix_sep=sep)

In [6]: dummies
Out[6]: 
   Budget  m__M1  m__M2  m__M7  qj__q23  qj__q4  qj__q9
0      39      1      0      0        1       0       0
1      15      0      1      0        0       1       0
2      13      0      0      1        0       0       1
3      53      1      0      0        1       0       0
4      82      0      1      0        1       0       0
5      70      1      0      0        0       0       1

In [7]: dfs = []
   ...: 
   ...: dummy_cols = list(set(col.split(sep)[0] for col in dummies.columns if sep in col))
   ...: other_cols = [col for col in dummies.columns if sep not in col]
   ...: 
   ...: for col in dummy_cols:
   ...:     dfs.append(dummies.filter(regex=col).rename(columns=lambda name: name.split(sep)[1]).idxmax(axis=1))
   ...: 
   ...: df = pd.concat(dfs + [dummies[other_cols]], axis=1)
   ...: df.columns = dummy_cols + other_cols
   ...: df
   ...: 
Out[7]: 
    qj   m  Budget
0  q23  M1      39
1   q4  M2      15
2   q9  M7      13
3  q23  M1      53
4  q23  M2      82
5   q9  M1      70
Nathan
  • 9,651
  • 4
  • 45
  • 65