deploy machine learning model with one hot encoded features

Question

I have trained an xgboost classifier with categorical features that I have previously one hot encoded. For example, I have a categorical feature 'Year' which takes values between 2014 and 2018. When OHEd I get 5 binary features: Year_2014, Year_2015, Year_2016, Year_2017, Year_2018. What happens if I make a prediction on a sample that has Year=2019 since the feature Year_2019 does not exist?

More generally, what is a robust way to transform data in order to make predictions on a new samples?

Why you don't actually *try* it, and report here any issues you might have? Questions like your "more generally" part are arguably off-topic for SO, which is about *practical coding* issues... — desertnaut, Mar 07 '19 at 18:05
Prediction function will fail. On the 2nd part of question - there is no straight forward answer. But you'll find good discussions in SO and other SE sites. Here's one - https://stackoverflow.com/questions/51505295/how-to-handle-one-hot-encoding-in-production-environment-when-number-of-features. — Supratim Haldar, Mar 07 '19 at 20:03

score 0 · Answer 1 · answered Mar 07 '19 at 20:20

0

Binary features are evaluated like this:

if(year != ${year value}){
  // Enter "left" branch
} else {
  // Enter "right" branch
}

An unseen category level gets sent to the "left" branch.

answered Mar 07 '19 at 20:20

user1808924

4,563
2
17
20

score 0 · Answer 2 · answered Feb 18 '21 at 06:17

0

#While traning say year has below values
df = pd.DataFrame([2014,2015,2016,2017,2018], columns = ['year']) 
data=pd.get_dummies(df,columns=['year']) 
data.head()
# while predicting lets say input for year is 2018
known_categories = ['2014','2015','2016','2017','2018']    
year_type = pd.Series(['2018']) 
year_type = pd.Categorical(year_type, categories = known_categories)
pd.get_dummies(year_type)
# column name does not matter only the values matters which will be input to the model

answered Feb 18 '21 at 06:17

Sanjay Nitwal

1

The community encourages adding explanations alongisde code, rather than purely code-based answers (see [here](https://meta.stackoverflow.com/questions/300837/what-comment-should-i-add-to-code-only-answers)) – Ahmet Feb 18 '21 at 10:15

deploy machine learning model with one hot encoded features

2 Answers2