4

I am handling a subset of the a large data set.

There is a column named "type" in the dataframe. The "type" are expected to have values like [1,2,3,4].

In a certain subset, I find the "type" column only contains certain values like [1,4],like

 In [1]: df
 Out[2]:
          type
    0      1
    1      4

When I create dummies from column "type" on that subset, it turns out like this:

In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]:        type_1 type_4
        0        1       0
        1        0       1

It does't have the columns named "type_2", "type_3".What i want is like:

 Out[6]:        type_1 type_2 type_3 type_4
            0      1      0       0      0
            1      0      0       0      1

Is there a solution for this?

jessie tio
  • 323
  • 2
  • 10

3 Answers3

3

What you need to do is make the column 'type' into a pd.Categorical and specify the categories

pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')

   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

Another solution with reindex_axis and add_prefix:

df1 = pd.get_dummies(df["type"])
        .reindex_axis([1,2,3,4], axis=1, fill_value=0)
        .add_prefix('type')
print (df1)
   type1  type2  type3  type4
0      1      0      0      0
1      0      0      0      1

Or categorical solution:

df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Since you tagged your post as one-hot-encoding, you may find sklearn module's OneHotEncoder useful, in addition to pure Pandas solutions:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5

# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))

# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])

print(newdf)

   type_0  type_1  type_2  type_3  type_4
0       0       1       0       0       0
1       0       0       0       0       1

One advantage of using this approach is that OneHotEncoder easily produces sparse vectors, for very large class sets. (Just change to sparse=True in the OneHotEncoder() declaration.)

andrew_reece
  • 20,390
  • 3
  • 33
  • 58