1

A colleague and myself are trying to detect anomalies in a large dataset. We want to try out different algorithms (LOF, OC-SVM, DBSCAN, etc) but we are currently working with IsolationForest.

Our dataset is currently shaped a follows. It's a count of the number of event-types logged per user per day, and contains > 300.000 records:

date user event count
6/1/2021 user_a Open 2
6/2/2021 user_a Open 4
6/1/2021 user_b Modify 3
6/2/2021 user_b Open 5
6/2/2021 user_b Delete 2
6/3/2021 user_b Open 7
6/5/2021 user_b Move 4
6/4/2021 user_c Modify 3
6/4/2021 user_c Move 6

Our goal is to automatically detect anomalous counts of events per user. For example a user who normally logs between 5 and 10 "open" events per day a count of 400 would be an outlier. My colleague and I are having a discussion on how we should prepare the dataset for the IsolationForest algorithm.

One of us is saying we should drop the date field and labelencode the rest of the data => encode all strings by integers and let IF calculate an outlier score for each of the records.

The other is of the opinion labelencoding should NOT be done, since replacing categorical data by integers cannot be done. The data should however be scaled, the user column should be dropped (or set as index), and the data within the event column should be pivotted to generate more dimensions (the example below shows what he wants to do):

date user event_Open event_Modify event_Delete event_Move
6/1/2021 user_a 2 NaN NaN NaN
6/2/2021 user_a 4 NaN NaN NaN
6/1/2021 user_b NaN 3 NaN NaN
6/2/2021 user_b 5 NaN 2 NaN
6/3/2021 user_b 7 NaN NaN NaN
6/5/2021 user_b NaN NaN NaN 4
6/4/2021 user_c NaN 3 NaN 6

So we're in disagreement on a couple of points. I'll list them below and include my thoughts on the them:

Issue Comment
Labelencoding Is a must and does not effect the categorical nature of the dataset
Scaling IsolationForest is by nature insensitive to scaling making scaling superfluous
Drop data column The date is actually not a feature in the dataset, as the date does not have any correlation to the anomalousness of the count per event-type per user
Drop user column User is actually a (critical) feature and should not be dropped
Pivot event column This generates a spare matrix, which can be bad practice. It also introduces relations within the data that are not there in reality (for example user_b on 2. june logged 5 open events and 2 delete events, but these are considered not related and should therefore not form a single record)

I am very curious to your thought on these points. What's best practice regarding the issues listed above while using the IsolationForest algorithm for anomaly detection?

timDS
  • 11
  • 1
  • You make claims that user is critical, and time (date/day-of-week etc) is not. However it does not seem that this is backed in any data analysis? Do the exploratory data analysis, and you will both learn what is true for this particular dataset and problem – Jon Nordby Aug 01 '21 at 12:11
  • Thank you for your answer Jon, it's highly appreciated. Exploratory data analysis is a must when trying to solve any data science related issue. How about the more technical part of my question? Should be scale the data, or is IF insensible to scaling? Is labelencoding the data ok, or is that problematic? – timDS Aug 02 '21 at 09:03
  • Bringing pivoting the event counts into columns is a good idea, if one replaces the NaNs with 0 (assuming that when no count is present, it means no events). – Jon Nordby Aug 02 '21 at 12:26
  • Scaling is not useful with IsolationForest – Jon Nordby Aug 02 '21 at 12:28
  • 1
    In the future, you should make sure that you have one question per SO question. And things that are not so programming related might be better on stats or datascience Stack Exchange – Jon Nordby Aug 02 '21 at 12:29
  • Thank you very much for your answer(-s). Will take your advise to heart and in the future will ask only one question per post on stackoverflow. – timDS Aug 03 '21 at 12:01

0 Answers0