Categorical Embeddings in an Unsupervised Setting for Anomaly Detection

Question

Context: I am working on an unsupervised use case. The Dataset I have has the following fields: TimeStamp, UserName and eventName Eg. User A has done Event B at Timestamp C

My objective is to perform an anomaly detection task. i.e. if UserA performs a new event C, tell if this is an anomaly or not.

My hypothesis is that if I can learn embeddings for events, this can give me good way to compare the similarity of the new event C with the previously performed events by User A and thus tell if this is an anomaly or not.

Now, the eventName is a categorical long tailed feature(i.e. few events are done in very large numbers while most of the events performed by user happen very infrequently) for most of the users. The number of distinct eventNames is in the range 300-400 where a user on an average might perform just 10 events out of these 300-400 on a day to day basis.

Question: I am not able to think through how do I go about learning the embeddings for events in my sample space.

I will highly appreciate any guidance on how to model this problem.

Do let me know if I missed providing any information that might help.

You may want to model very common events separately from rare ones. Especially if common actions obscure sequences of rarer actions — Jon Nordby, Dec 04 '20 at 08:53
In order to give more weight to rare events, you can consider something like TF-IDF from natural language processing — Jon Nordby, Dec 04 '20 at 08:54

score 0 · Answer 1 · answered Dec 03 '20 at 10:21

Start simple. Divide the data up into a suitable time-interval, for example 1 day. And then compute basic statistics inside this interval. For example, how many events of each event type. Visualize these statistics across users and time, to get an idea of the patterns that are in your data. To compute anomaly score, find a way to compute a distance function from the features on a time-period compared to typical statistics. A basic starting point might be Mahalanobis distance. Or try some simple anomaly detection algorithms like IsolationForest, LocalOutlierFactor.

Only after this consider more advanced approaches. Like modelling grouping/sequences of events, or sub-population modelling of users, et.c

I already did some data visualizations on time frames of week, day and hour granularity and observed the patterns. The problem with engineering features related to each event Type is that there are so many eventtype and a lot of them happen rarely(but they are important, since I am doing anomaly detection). Hence thought of embeddings for events or users to help with this. — Mohit Munjal, Dec 04 '20 at 04:55

Categorical Embeddings in an Unsupervised Setting for Anomaly Detection

1 Answers1