I am working on a classification problem. My training data has some categorical variables that I want to convert to dummy variables. This I can easily do with Pandas.
The problem is, what if the test data has some levels that is not present in the train data. How can I convert the test data to the 1-hotcoded data that has same schema as the train data?
For example:
train data
id attribute class
-------------------
1 'a' 'good'
2 'b' 'bad'
3 'c' 'good'
4 'd' 'bad'
1-hot encoded train data
id dummy_attr_a dummy_attr_b dummy_attr_c dummy_attr_d class
-----------------------------------------------------------------
1 1 0 0 0 'good'
2 0 1 0 0 'bad'
3 0 0 1 0 'good'
4 0 0 0 1 'bad'
test data
id attribute class
-------------------
1 'a' 'good'
2 'e' 'bad'
The problem is that I cannot convert this into dummy variables directly, as this would make only two attributes dummy_attr_a
and dummy_attr_e
(not present in the train data).