0

I am working on a classification problem. My training data has some categorical variables that I want to convert to dummy variables. This I can easily do with Pandas.

The problem is, what if the test data has some levels that is not present in the train data. How can I convert the test data to the 1-hotcoded data that has same schema as the train data?

For example:

train data

id attribute  class
-------------------
1   'a'       'good'
2   'b'       'bad' 
3   'c'       'good'
4   'd'       'bad'

1-hot encoded train data

id  dummy_attr_a  dummy_attr_b  dummy_attr_c  dummy_attr_d  class
-----------------------------------------------------------------
1        1              0            0             0        'good'      
2        0              1            0             0        'bad'
3        0              0            1             0        'good'
4        0              0            0             1        'bad'

test data

id attribute  class
-------------------
1   'a'       'good'
2   'e'       'bad'

The problem is that I cannot convert this into dummy variables directly, as this would make only two attributes dummy_attr_a and dummy_attr_e (not present in the train data).

Sonu Mishra
  • 1,659
  • 4
  • 26
  • 45

0 Answers0