0

I have a dataset that looks like below:

| Amount   | Source | y |
| -------- | ------ | - |
| 285      | a      | 1 |
| 556      | b      | 0 | 
| 883      | c      | 0 |
| 156      | c      | 1 |
| 374      | a      | 1 |
| 1520     | d      | 0 |

'Source' is the categorical variable. The categories in this field are 'a', 'b', 'c' and 'd'. So the one hot encoded columns are 'source_a', 'source_b', 'source_c' and 'source_d'. I am using this model to predict values for y. The new data for prediction does not contain all categories used in training. It only has categories 'a', 'c' and 'd'. When i one hot encode this dataset, it is missing the column 'source_b'. How do i transform this data to look like training data?

PS: I am using XGBClassifier() for prediction.

Sudhakar Samak
  • 389
  • 4
  • 15

2 Answers2

2

Use the same encoder instance. Assuming you opted for sklearn's one hot encoder all you have to do is export it as a pickle to use it later for inference when needed.

from sklearn.preprocessing import OneHotEncoder
import pickle
# blah blah blah

enc = OneHotEncoder(handle_unknown='ignore')
#assume X_train = the source column
X_train = enc.fit_transform(X_train)
pickle.dump(enc, open('onehot.pickle', 'wb'))

And then load it for inference:

import pickle
loaded_enc = pickle.load(open("onehot.pickle", "rb"))

then all you have to do is hit:

#X_test is the source column of your test data
X_test = loaded_enc.transform(X_test)

In general, after you fit your encoder to X_train all you have to do is simply transform the test set. So

X_test = loaded_enc.transform(X_test)
Gaussian Prior
  • 756
  • 6
  • 16
  • The result of this is a sparse matrix. Can i get this into a data frame since i have to join the numeric columns to these one hot encoded columns? In this case, how do i join 'Amount' to this result and use it for training the model? – Sudhakar Samak Sep 15 '21 at 19:25
  • @SudhakarSamak, the encoder has a `sparse` option. You may also want to look into `ColumnTransformer`. – Ben Reiniger Sep 15 '21 at 21:17
2

write them down explicitly:

import pandas as pd
import numpy as np

# an example of your dataframe with no "b" source
df = pd.DataFrame({
                    "Amount" : [int(i) for i in np.random.normal(800,300, 10)],
                    "Source" : np.random.choice(["a", "c", "d"], 10),
                    "y"      : np.random.choice([1,0], 10)
                    })
# One Hot Encoding
df["source_a"] = np.where(df.Source == "a",1,0)

df["source_b"] = np.where(df.Source == "b",1,0)

df["source_c"] = np.where(df.Source == "c",1,0)

df["source_d"] = np.where(df.Source == "d",1,0)

output of the dataframe:

   Amount Source  y  source_a  source_b  source_c  source_d
0     685      d  0         0         0         0         1
1    1149      c  1         0         0         1         0
2    1220      a  0         1         0         0         0
3     834      c  0         0         0         1         0
4     780      c  0         0         0         1         0
5     502      a  0         1         0         0         0
6     191      c  1         0         0         1         0
7     637      c  0         0         0         1         0
8     701      d  0         0         0         0         1
9     941      c  1         0         0         1         0

For general rule dependencies must be minimized...

Alejo
  • 315
  • 1
  • 10