2

I am trying to use a high cardinality feature (siteid) in a sci-kit learn model and am using get_dummies to one-hot encode this feature. I get around 800 new binary columns which returns a decent accuracy using logistic regression. My problem is that when I pass a new dataset through my model I have a different cardinality on this feature with say 300 unique values and the model rightly asks, where are the other 500 columns you trained me on? How can I resolve this?

I don't want to have to train the model every time the cardinality changes nor do i want to hard code these columns in my SQL data load.

cat_columns = ["siteid"]

df = pd.get_dummies(df, prefix_sep="__",
                              columns=cat_columns)
Jan
  • 42,290
  • 8
  • 54
  • 79

2 Answers2

0

My recommendation would be to pad these remaining columns with zeros. So if your new training sample has, for example 10 unique values, and the model expects 50 values (number of total_cols), then create 40 columns of zeros to the right to "fill out" the rest of the data:

df = pd.DataFrame({"siteid": range(10)})
cat_columns = ["siteid"]
df1 = pd.get_dummies(df, columns=cat_columns)

# df1 has shape (10, 10)

total_cols = 50    # Number of columns that model expects
zero_padding = pd.DataFrame(np.zeros((df1.shape[0], total_cols - df1.shape[1])))
df = pd.concat([df1, zero_padding], axis=1)
df.columns = ["siteid__" + str(i) for i in range(df.shape[1])]

# df now has shape (10, 50)
Ted
  • 1,189
  • 8
  • 15
  • Thanks this is useful, but my feature may have a higher cardinality than my original training dataset in the future and then wont be useful with my model. What I'm really looking for is a way to pass this feature as a single column that my model wont interpret as continuous integers. – Classic123456 Aug 28 '19 at 13:11
  • @Classic123456 If you pass new data to your model with columns that it has never seen before, then it won't know what to do with these new columns. – Ted Aug 28 '19 at 13:14
  • Yes good point, I suppose I wanted to ignore new columns until I retrain the model and didn't want it to break if I pass unfamiliar columns. @Philip Martin 's solution does that. – Classic123456 Aug 29 '19 at 08:08
  • @Classic123456 Ok I see. Yes his answer is very useful. – Ted Aug 29 '19 at 08:16
0

I suggest using scikit-learn's OneHotEncoder.

documentation here

In your case, the usage would look something like

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df[['cat_columns']])

categories = [cat for cats in enc.categories_ for cat in cats]

df[categories] = enc.transform(df[['cat_columns']])

the handle_unknown parameter is key, and the enc object is necessary for repeatability for new data.

On new dataframes you would run

df_new[categories] = enc.transform(df_new[['cat_columns']])

This will hot-encode the same categories and ignore any new ones that your model is not accustomed to.

Phillip Martin
  • 1,910
  • 15
  • 30
  • I'm afraid I could not get this code to work. I have since used the get_dummies technique described here to resolve this https://blog.cambridgespark.com/robust-one-hot-encoding-in-python-3e29bfcec77e – Classic123456 Aug 29 '19 at 13:52