1

My intentions are to create an SKLearn Isolation Forest Model and regularly predict the new data that comes in on this model (saved and opened from a .pickle file). My issues is that I have One-Hot Encoded my categorical features with high cardinality so the dimensionality is different between n_features_in_ value, and the unseen data being used in predict() and score_samples() methods.

The first time I am running the script, I am splitting the data into testing_data and training_data dataframes and then accounting for the missing features (most of the missing features between the two are dummy features from one or more of the cat. variables).

col_list = list(set().union(testing_data.columns, training_data.columns))
training_data = training_data.reindex(columns=col_list, fill_value=0)
testing_data = testing_data.reindex(columns=col_list, fill_value=0)

On subsequent executions via CRON schedules of this script, I would like to pull the fitted model's .pickle file (saved from previous script execution), and use the fitted model to perform further predict() and score_samples() methods on new data. I run into the same issue where the number of features on the new testing_data does not equal n_features_in_, but cant now use the lines of code above.

Any suggestions on this specific issue, or on ways to perform CRON-based Continuous Training and reusing of "pickled" classifiers?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
MaJunior
  • 11
  • 2
  • edit: I read the below link and found that using SKLearns One-Hot Encoding method is better than pd.get_dummies() but this would still only help me during the first execution. Beyond that I do not have "access" to the training data that was used to fit the model I unpickled. https://datascience.stackexchange.com/questions/18956/different-number-of-features-in-train-vs-test – MaJunior Jul 22 '22 at 19:56
  • To make sure I understand, there are categorical variable values in your test data that are not in your train data, and this changes the size of your input test dataframe to the model? For example, your training data may have values 'a' and 'b', but then your test data has a new value 'c'? – Karmen Jul 22 '22 at 20:19
  • Karmen, yes you are correct. So when I load the already fitted model from pickle file, I dont have a way to "normalize" the columns together. My thought would be to also pickle a list of column names that have been used to fit the model in each running of the script, but this idea would fail if the new testing data had a new categorical feature value dummied that isnt in the column list or the data in the fitted model. Hope that made sense. – MaJunior Jul 22 '22 at 20:34

1 Answers1

0

If a categorical value is not present in your training data, your model will not be trained to use it effectively on your test data.

To ignore the values not in your train data, you can save your OneHotEncoder along with the model, and apply the same encoder to your test data.

For example:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

train_df = pd.DataFrame(['a', 'b'], columns=['category'])
test_df = pd.DataFrame(['a', 'b', 'c'], columns=['category'])

# fit the encoder on training data
ohe = OneHotEncoder(handle_unknown='ignore').fit(train_df)

# transform your test data but do not encode new category values
test_transformed = ohe.transform(test_df).toarray()

print(test_transformed)

[[1. 0.]
 [0. 1.]
 [0. 0.]]

There are some other suggestions in this thread: How to handle unseen categorical values in test data set using python?

I would recommend logging the new category values though as you may need to retrain your model to work on these new cases.

Karmen
  • 367
  • 1
  • 3
  • 9
  • Thank you very much for your answer and your time. I will consider this. – MaJunior Jul 22 '22 at 21:32
  • For my anomaly detection situation, I would like to perform some action on handle_unknown as the presence of a new category value is anomalous. For instance if a new port number is seen in testing and hasnt been seen in the trained model, instead of ignoring or erroring out, I would like to lets say explicitly label that sample as anomalous. – MaJunior Jul 24 '22 at 15:24
  • @MaJunior that sounds like a new question, it's worth searching if it has been asked already and if not then open a new question. Or you can update the title and description of this one and include as detailed of a code example as you can. – Karmen Jul 25 '22 at 10:40