create dummies from a column for a subset of data, which does't contains all the category value in that column

Question

I am handling a subset of the a large data set.

There is a column named "type" in the dataframe. The "type" are expected to have values like [1,2,3,4].

In a certain subset, I find the "type" column only contains certain values like [1,4],like

 In [1]: df
 Out[2]:
          type
    0      1
    1      4

When I create dummies from column "type" on that subset, it turns out like this:

In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]:        type_1 type_4
        0        1       0
        1        0       1

It does't have the columns named "type_2", "type_3".What i want is like:

 Out[6]:        type_1 type_2 type_3 type_4
            0      1      0       0      0
            1      0      0       0      1

Is there a solution for this?

score 3 · Answer 1 · answered Apr 27 '17 at 05:15

What you need to do is make the column 'type' into a pd.Categorical and specify the categories

pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')

   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

jezrael · Accepted Answer · 2017-04-27T05:22:06.577

2

Another solution with reindex_axis and add_prefix:

df1 = pd.get_dummies(df["type"])
        .reindex_axis([1,2,3,4], axis=1, fill_value=0)
        .add_prefix('type')
print (df1)
   type1  type2  type3  type4
0      1      0      0      0
1      0      0      0      1

Or categorical solution:

df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

edited Apr 27 '17 at 05:22

answered Apr 27 '17 at 05:16

jezrael

822,522
95
1,334
1,252

Glad can help you. Nice day! – jezrael Apr 27 '17 at 05:46

score 2 · Answer 3 · answered Apr 27 '17 at 05:35

Since you tagged your post as one-hot-encoding, you may find sklearn module's OneHotEncoder useful, in addition to pure Pandas solutions:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5

# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))

# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])

print(newdf)

   type_0  type_1  type_2  type_3  type_4
0       0       1       0       0       0
1       0       0       0       0       1

One advantage of using this approach is that OneHotEncoder easily produces sparse vectors, for very large class sets. (Just change to sparse=True in the OneHotEncoder() declaration.)

create dummies from a column for a subset of data, which does't contains all the category value in that column

3 Answers3