0

I am trying to build an isolation forest using scikit learn and python to detect anomalies. I have attached an image of what the data may look like, and I am trying to predict 'pages' based on several 'size' features. enter image description here When I print(anomaly), every single row is detected as -1, an anomaly. Is this because I am only using 'size2' to classify them? Is there a way to use multiple columns to help in detecting the anomalies? Should I be making n_features equal to the number of columns I am using? Thank you so much for your help.

model = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = 'auto')
model.fit(df[['pages']])
df['size2'] = model.decision_function(df[['pages']])
df['anomaly']= model.predict(df[['pages']])
print(df.head(50))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)

1 Answers1

1

I'm not sure an isolation forest is appropriate here. If you want to predict pages column values based on size data, you would be better off using either a regression model or a classifier (I can't tell whether pages is categorical based on the data shown). With that said, if you do want to do anomaly detection, you have to make sure that you're fitting your model on the same subset of features you use for prediction. To detect anomalies based on the size features looks something like this:

df['anomaly'] = model.fit_predict(df[['size2', 'size3', 'size4']])

Any subset of columns can be chosen to train the model on, but calls to both fit and predict must be made with the same feature set.

In the code given, the model is trained on the label column but used to predict outliers based on the pages column. Although the label column isn't shown, if the values in it are substantially different than those in the pages column it's not surprising that they would all be categorized as outliers. In addition, as written the size2 column is not being used as a feature for prediction but rather being overwritten by the decision function scores for the pages column.

Charles Gleason
  • 416
  • 5
  • 8
  • Thank you so much! The 'label' should actually be 'pages', sorry for the confusion! –  Jun 11 '20 at 20:41
  • When I added df['anomaly'] = df.fit_predict(df[['size2', 'size3', 'size4']]), I got the error AttributeError: 'DataFrame' object has no attribute 'fit_predict', do you know why this is? Thanks! –  Jun 11 '20 at 20:51
  • Yes; sorry about that! Should have been `model.fit_predict`, it's fixed now. – Charles Gleason Jun 11 '20 at 21:11
  • Thank you so much! –  Jun 11 '20 at 21:17