1-hot encoding on test data in Python

Asked Jun 04 '16 at 22:21

Active Jun 04 '16 at 22:31

Viewed 49 times

I am working on a classification problem. My training data has some categorical variables that I want to convert to dummy variables. This I can easily do with Pandas.

The problem is, what if the test data has some levels that is not present in the train data. How can I convert the test data to the 1-hotcoded data that has same schema as the train data?

For example:

train data

id attribute  class
-------------------
1   'a'       'good'
2   'b'       'bad' 
3   'c'       'good'
4   'd'       'bad'

1-hot encoded train data

id  dummy_attr_a  dummy_attr_b  dummy_attr_c  dummy_attr_d  class
-----------------------------------------------------------------
1        1              0            0             0        'good'      
2        0              1            0             0        'bad'
3        0              0            1             0        'good'
4        0              0            0             1        'bad'

test data

id attribute  class
-------------------
1   'a'       'good'
2   'e'       'bad'

The problem is that I cannot convert this into dummy variables directly, as this would make only two attributes dummy_attr_a and dummy_attr_e (not present in the train data).

edited Jun 04 '16 at 22:31

asked Jun 04 '16 at 22:21

Sonu Mishra

1,659
4
26
45

Thanks. I will take a look. – Sonu Mishra Jun 05 '16 at 06:24
1

Specifically take a look at http://stackoverflow.com/a/37451867/2285236 Working with categories seems to solve the issue. – ayhan Jun 05 '16 at 06:25

1-hot encoding on test data in Python

0 Answers0