6

While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:

Training Set:

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Hatchback     |
|  2  | Sedan         |
|  3  | Coupe         |
|  4  | SUV           |
-----------------------

After One- Hot Encoding this, this is what we get:

-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
|  1  |     1     |   0   |   0    |  0 |
|  2  |     0     |   1   |   0    |  0 |
|  3  |     0     |   0   |   1    |  0 |
|  4  |     0     |   0   |   0    |  1 |
-----------------------------------------

My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:

Test Set :

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Coupe         |
|  2  | Sedan         |
-----------------------

One-Hot Encoding results in :

---------------------------
| Ser | Coupe     | Sedan |
---------------------------
|  1  |     1     |   0   |
|  2  |     0     |   1   |
|  3  |     1     |   0   |
---------------------------

Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?

  • Hmm, you would hope you don't get too many new cases, If you do, your training sample might not be representative. Worst case, if you get a new type of car, just let all entries in the type columns be 0. – Demetri Pananos Jul 24 '18 at 18:27
  • I do not think this will work in a production environment because the number of anamolies would vary and pluggin in zero values doesnt any good. It worsens the model. There are other methods such as Contrast coding, HElbert coding.etc. Have you used it? – Roshan Joe Vincent Jul 24 '18 at 18:35
  • 1
    The point is, if you haven't trained on those values, you have no idea what to do with them. They might as well be missing. The default thing most libraries like pandas do is zero out the entries in the one hot encoded columns (see my answer). If you have another strategy, why even post here? – Demetri Pananos Jul 24 '18 at 18:36
  • I wanted to know how people do it. Seems like everyone just adds Zeros. – Roshan Joe Vincent Jul 24 '18 at 18:38
  • This is a common scenario, isn't it? I was expecting this question to flood with answers. Amused to see only two. – Pramesh Bajracharya May 20 '19 at 09:48

2 Answers2

6

I'll assume you are using pandas to do the one hot encoding. If not, you have to do some more work, but the logic is still the same.

import pandas as pd

known_categories = ['Sedan','Coupe','Limo'] # from training set

car_type = pd.Series(['Sedan','Ferrari']) # new category in production, 'Ferrari'

car_type = pd.Categorical(car_type, categories = known_categories)

pd.get_dummies(car_type)

Result is

    Sedan   Coupe   Limo
0   1.0      0.0    0.0    # Sedan entry
1   0.0      0.0    0.0    # Ferrari entry

Since Ferrari is not in the list of known categories, all the one ot encoded entries for the Ferrari are zero. If you find a new car type in your production data, the rows encoding the car type should all be 0.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Demetri Pananos
  • 6,770
  • 9
  • 42
  • 73
  • What if there is a entry in the test set that is not there in the train? – Roshan Joe Vincent Jul 24 '18 at 18:36
  • That is what this example shows. If the `known`categories` were all the unique car types from your training set, and `car_type` are the car types from your test set, this is what would happen (according to pandas) – Demetri Pananos Jul 24 '18 at 18:37
0

The input to your model in production should be the same as during training. So if during training you one-hot encode 4 categories - do the same in production. Use zeros for missing features. Drop features you have not seen during training.

Andrew
  • 24,218
  • 13
  • 61
  • 90