My intentions are to create an SKLearn Isolation Forest Model and regularly predict the new data that comes in on this model (saved and opened from a .pickle file). My issues is that I have One-Hot Encoded my categorical features with high cardinality so the dimensionality is different between n_features_in_
value, and the unseen data being used in predict()
and score_samples()
methods.
The first time I am running the script, I am splitting the data into testing_data
and training_data
dataframes and then accounting for the missing features (most of the missing features between the two are dummy features from one or more of the cat. variables).
col_list = list(set().union(testing_data.columns, training_data.columns))
training_data = training_data.reindex(columns=col_list, fill_value=0)
testing_data = testing_data.reindex(columns=col_list, fill_value=0)
On subsequent executions via CRON schedules of this script, I would like to pull the fitted model's .pickle file (saved from previous script execution), and use the fitted model to perform further predict()
and score_samples()
methods on new data. I run into the same issue where the number of features on the new testing_data
does not equal n_features_in_
, but cant now use the lines of code above.
Any suggestions on this specific issue, or on ways to perform CRON-based Continuous Training and reusing of "pickled" classifiers?