7

Suppose I have location feature. In train data set its unique values are 'NewYork', 'Chicago'. But in test set it has 'NewYork', 'Chicago', 'London'. So while creating one hot encoding how to ignore 'London'? In other words, How not to encode the categories that only appear in the test set?

Neo
  • 4,200
  • 5
  • 21
  • 27

3 Answers3

2

You can use the parameter handle_unknown in one hot encoding.

ohe = OneHotEncoder(handle_unknown=‘ignore’)

This will not show an error and will let execution occur.

See Documentation for more https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

devansh
  • 89
  • 2
  • 8
1

Often you never want to eliminate information. You want to wrap this information prior within your model. For example you might have some data with NaN values:

train_data = ['NewYork', 'Chicago', NaN]

Solution 1

You will likely have a way of dealing with this, whether you impute, delete, etc.. is up to you based on the problem. More often than not you can have NaN be it's own category, as this is information as well. Something like this can suffice:

# function to replace NA in categorical variables
def fill_categorical_na(df, var_list):
  X = df.copy()
  X[var_list] = df[var_list].fillna('Missing')
  return X

# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)

Therefore, when you move to production you could write a script that pushes unseen categories into this "missing" category you've established earlier.

Solution 2

If you're not satisfied with that idea, you could always turn these unusual cases into a new unique category that we'll call "rare" because it's not present often.

train_data = ['NewYork', 'Chicago', 'NewYork', 'Chicago', 'London']

# let's capture the categorical variables first
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']

def find_frequent_labels(df, var, rare_perc):
  df = df.copy()
  tmp = df.groupby(var)['Target_Variable'].count() / len(df)
  return tmp[tmp>rare_perc].index

for var in cat_vars:
  frequent_ls = find_frequent_labels(X_train, var, 0.01)
  X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
  X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')

Now, given enough instances of the "normal" categories, London will get pushed into the "Rare" category. Regardless of how many new categories might show up, they will be grouped into 'Rare' as a category; pending they remain rare instances and don't become dominate categories.

kevin_theinfinityfund
  • 1,631
  • 17
  • 18
  • Someone explain the downvote. This clearly addresses the question. – kevin_theinfinityfund Jan 20 '20 at 08:09
  • The problem with your answer is that you conveniently ignore that unseen categories remain unseen. It does not matter if you rename `London` into `NaN` or `rare`, and what encoding scheme you use. If every training example is either `NewYork` or `Chicago`, the model has no opportunity to learn what to do with other categories. – paperskilltrees Aug 24 '22 at 04:41
  • 1
    I’m saying that during the serving layer of whatever ML system you’re using you’d always be filling in any non-frequent type as “rare”. In your training layer and model development stage you will have some small percentage of “rare” categories. Ex: If it’s mean imputation let’s say “rare”=0.1. During model serving you get a never-before-seen “Toronto.” Your serving layer will map “Toronto” -> “rare” -> 0.1. – kevin_theinfinityfund Aug 25 '22 at 05:49
  • Fair enough! I guess this would be the normal real-world scenario. Still it is possible OP had a toy, academic or a very special example in which this is not the case. One could probably synthesize the "rare" category by sampling from other categories or by hard-code the train mean into the imputer for unseen categories. – paperskilltrees Aug 25 '22 at 06:01
-1

Assuming this to be your lists

train_data = ['NewYork', 'Chicago']
test_set = ['NewYork', 'Chicago', 'London']

Based on your question :

How not to encode the categories that only appear in the test set?

for each in test_set:
    if filter(lambda element: each in element, train_data):
        print each

This outputs NewYork & Chicago, which means London is skipped.

Jaimin Ajmeri
  • 572
  • 3
  • 18