IsolationForest, transforming data

Question

A colleague and myself are trying to detect anomalies in a large dataset. We want to try out different algorithms (LOF, OC-SVM, DBSCAN, etc) but we are currently working with IsolationForest.

Our dataset is currently shaped a follows. It's a count of the number of event-types logged per user per day, and contains > 300.000 records:

date	user	event	count
6/1/2021	user_a	Open	2
6/2/2021	user_a	Open	4
6/1/2021	user_b	Modify	3
6/2/2021	user_b	Open	5
6/2/2021	user_b	Delete	2
6/3/2021	user_b	Open	7
6/5/2021	user_b	Move	4
6/4/2021	user_c	Modify	3
6/4/2021	user_c	Move	6

Our goal is to automatically detect anomalous counts of events per user. For example a user who normally logs between 5 and 10 "open" events per day a count of 400 would be an outlier. My colleague and I are having a discussion on how we should prepare the dataset for the IsolationForest algorithm.

One of us is saying we should drop the date field and labelencode the rest of the data => encode all strings by integers and let IF calculate an outlier score for each of the records.

The other is of the opinion labelencoding should NOT be done, since replacing categorical data by integers cannot be done. The data should however be scaled, the user column should be dropped (or set as index), and the data within the event column should be pivotted to generate more dimensions (the example below shows what he wants to do):

date	user	event_Open	event_Modify	event_Delete	event_Move
6/1/2021	user_a	2	NaN	NaN	NaN
6/2/2021	user_a	4	NaN	NaN	NaN
6/1/2021	user_b	NaN	3	NaN	NaN
6/2/2021	user_b	5	NaN	2	NaN
6/3/2021	user_b	7	NaN	NaN	NaN
6/5/2021	user_b	NaN	NaN	NaN	4
6/4/2021	user_c	NaN	3	NaN	6

So we're in disagreement on a couple of points. I'll list them below and include my thoughts on the them:

Issue	Comment
Labelencoding	Is a must and does not effect the categorical nature of the dataset
Scaling	IsolationForest is by nature insensitive to scaling making scaling superfluous
Drop data column	The date is actually not a feature in the dataset, as the date does not have any correlation to the anomalousness of the count per event-type per user
Drop user column	User is actually a (critical) feature and should not be dropped
Pivot event column	This generates a spare matrix, which can be bad practice. It also introduces relations within the data that are not there in reality (for example user_b on 2. june logged 5 open events and 2 delete events, but these are considered not related and should therefore not form a single record)

I am very curious to your thought on these points. What's best practice regarding the issues listed above while using the IsolationForest algorithm for anomaly detection?

You make claims that user is critical, and time (date/day-of-week etc) is not. However it does not seem that this is backed in any data analysis? Do the exploratory data analysis, and you will both learn what is true for this particular dataset and problem — Jon Nordby, Aug 01 '21 at 12:11
Thank you for your answer Jon, it's highly appreciated. Exploratory data analysis is a must when trying to solve any data science related issue. How about the more technical part of my question? Should be scale the data, or is IF insensible to scaling? Is labelencoding the data ok, or is that problematic? — timDS, Aug 02 '21 at 09:03
Bringing pivoting the event counts into columns is a good idea, if one replaces the NaNs with 0 (assuming that when no count is present, it means no events). — Jon Nordby, Aug 02 '21 at 12:26
In the future, you should make sure that you have one question per SO question. And things that are not so programming related might be better on stats or datascience Stack Exchange — Jon Nordby, Aug 02 '21 at 12:29
Thank you very much for your answer(-s). Will take your advise to heart and in the future will ask only one question per post on stackoverflow. — timDS, Aug 03 '21 at 12:01

IsolationForest, transforming data

0 Answers0