4

I have a categorical variable with known levels (e.g. hour that just contains values between 0 and 23), but not all of them are available right now (say, we have measurements from between 0 and 11 o'clock, while hours from 12 to 23 are not covered), though other values are going to be added later. If we naively use pandas.get_dummies() to map values to indicator variables, we will end up with only 12 of them instead of 24. Is there a way to map values of the categorical variable to a predefined list of dummy variables?

Here's an example of expected behaviour:

possible_values = range(24)
hours = get_dummies_on_steroids(df['hour'], prefix='hour', levels=possible_values)
ffriend
  • 27,562
  • 13
  • 91
  • 132

1 Answers1

10

Using the new and improved Categorical type in pandas 0.15:

import pandas as pd
import numpy as np
df = pd.DataFrame({'hour': [0, 1, 3, 8, 13, 14], 'val': np.random.randn(6)})
df
Out[4]: 
   hour       val
0     0 -0.098287
1     1 -0.682777
2     3  1.000749
3     8 -0.558877
4    13  1.423675
5    14  1.461552

df['hour_cat'] = pd.Categorical(df['hour'], categories=range(24))
pd.get_dummies(df['hour_cat'])
Out[6]: 
   0   1   2   3   4   5   6   7   8   9  ...  
0   1   0   0   0   0   0   0   0   0   0 ...      
1   0   1   0   0   0   0   0   0   0   0 ...   
2   0   0   0   1   0   0   0   0   0   0 ...   
3   0   0   0   0   0   0   0   0   1   0 ...   
4   0   0   0   0   0   0   0   0   0   0 ...   
5   0   0   0   0   0   0   0   0   0   0 ...

The situation you describe, where you know your data can take a specific set of values, but you haven't necessarily observed all of them, is exactly what Categorical is good for.

Marius
  • 58,213
  • 16
  • 107
  • 105
  • Speed of answers is what always delights me on StackOverflow. Thanks, it works perfectly well! – ffriend Nov 03 '14 at 23:19
  • FYI, i think might be a small issue in that ``pd.get_dummies`` is returning float dtypes here: https://github.com/pydata/pandas/issues/8725 – Jeff Nov 03 '14 at 23:36
  • @Jeff: this is pretty unexpected behaviour, so thanks for noting! – ffriend Nov 05 '14 at 22:26
  • well categorical is a new type and it has some edge cases - this will be cleaned up in 0.15.2 (0.15.1 releasing in a few days) – Jeff Nov 06 '14 at 00:25