pd.get_dummies
allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
7 Answers
It's been a few years, so this may well not have been in the pandas
toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax
will return the index corresponding to the largest element (i.e. the one with a 1
). We do axis=1
because we want the column name where the 1
occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical
(and pd.Series
, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to @piRSquared's comment:
This solution does indeed assume there's one 1
per row. I think this is usually the format one has. pd.get_dummies
can return rows that are all 0 if you have drop_first=True
or if there are NaN
values and dummy_na=False
(default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a
in the example above).
If drop_first=True
, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False
(default).
Since dummy_na=False
is the default, this could certainly cause problems. Please set dummy_na=True
when you call pd.get_dummies
if you want to use this solution to invert the "dummification" and your data contains any NaNs
. Setting dummy_na=True
will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaN
s. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any())
. What's also nice is that idxmax
solution will correctly regenerate your NaN
s (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True
and dummy_na=False
means that NaN
s become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN
values.

- 9,651
- 4
- 45
- 65
-
3This fails where a row is all zeros. It works for this example and under the assumption that one and only one `1` value exists per row. – piRSquared Aug 14 '18 at 13:29
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories()
, see here

- 125,376
- 21
- 220
- 187
This is quite a late answer, but since you ask for a quick way to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where
instead of idxmax
or get_level_values
, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as @Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where
method is about 4 times faster than the get_level_values
method 11.5 times faster than the idxmax
method! It also beats (but only by a little) the .dot()
method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True

- 49,704
- 8
- 81
- 106
Setup
Using @Jeff's setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1
per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1
per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1
per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don't assume one 1
per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object

- 285,575
- 57
- 475
- 624
Another option is using the function from_dummies
from pandas
version 1.5.0
. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies
:
pd.from_dummies(df)
0 a
1 b
2 a
3 c

- 35,235
- 5
- 20
- 53
Converting dat["classification"] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you're categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don't form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical
full with np.nan
and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b
has False
for all three categories (since the mark is NaN
and pass
is True
).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN
was kept where none of the categories applied.
This approach is as performant as using np.where
but allows for the flexibility of possible NaN
s.

- 2,838
- 1
- 22
- 50