I am trying to use a high cardinality feature (siteid) in a sci-kit learn model and am using get_dummies to one-hot encode this feature. I get around 800 new binary columns which returns a decent accuracy using logistic regression. My problem is that when I pass a new dataset through my model I have a different cardinality on this feature with say 300 unique values and the model rightly asks, where are the other 500 columns you trained me on? How can I resolve this?
I don't want to have to train the model every time the cardinality changes nor do i want to hard code these columns in my SQL data load.
cat_columns = ["siteid"]
df = pd.get_dummies(df, prefix_sep="__",
columns=cat_columns)