I have run into this issue, and I couldn't find a solution through scikit-learn
.
I am using pandas .get_dummies()
to do something similar to OneHotEncoder
.
Below follows a function I made to deal with this exact issue, feel free to use it and improve it (and please let me know if you find any errors, I actually just made it from a more specific function I had in my codebase):
import numpy as np
import pandas as pd
def one_hot_encoding_fixed_columns(pandas_series, fixed_columns):
# Creates complete fixed columns list (with nan and 'other')
fixed_columns = list(fixed_columns)
fixed_columns.extend([np.nan, 'other'])
# Get dummies dataset
ohe_df = pd.get_dummies(pandas_series, dummy_na=True)
# Create blank 'other' column
ohe_df['other'] = 0
# Check if columns created by get_dummies() are in 'fixed_columns' list.
for column in ohe_df.columns:
if column not in fixed_columns:
# If not in 'fixed_columns', transforms exceeding column into 'other'.
ohe_df['other'] = ohe_df['other'] + ohe_df[column]
ohe_df.drop(columns=[column])
# Check if elements in 'fixed_columns' are in the df generated by get_dummies()
for column in fixed_columns:
if column not in ohe_df.columns:
# If the element is not present, create a new column with all values set to 0.
ohe_df['column'] = 0
# Reorders columns according to fixed columns
ohe_df = ohe_df[fixed_columns]
return ohe_df
Basically, you create a list with columns that will always be used. If the test
sample doesn't not have any elements of a given category a corresponding column with values = 0
is created. If the test
has a new value that wasn't in the train
sample, it is categorized as other
.
I have commented out the code and I hope it is understandable, if you have any questions just let me know and I'll clarify it.
Input of this function is pandas_series = df['column_name']
, and you could do something like fixed_columns = df[selected_column].str[0].value_counts().index.values
on the training set to generate the values to will also be used on the test set.