2

I am new in deep learning field, i would like to ask about unlabeled dataset for Anomaly Detection using Autoencoder. my confusing part start at a few questions below:

1) some post are saying separated anomaly and non-anomaly (assume is labelled) from the original dataset, and train AE with the only non-anomaly dataset (usually amount of non-anomaly will be more dominant). So, the question is how am I gonna separate my dataset if it is unlabeled?

2) if I train using the original unlabeled dataset, how to detect the anomaly data?

CodeNameBobby
  • 21
  • 1
  • 3

2 Answers2

1

Label of data doesn't go into autoencoder.

Auto Encoder consists of two parts Encoder and Decoder

Encoder: It encodes the input data, say a sample with 784 features to 50 features

Decoder: from those 50 features it converts it back to original feature i.e 784 features.

Now to detect anomaly, if you pass an unknown sample, it should be converted back to its original sample without much loss. But if there is a lot of error in converting it back. then it could be an anomaly.

Picture Credit: towardsdatascience.com

  • understand that the label does not go in as AE input. My question here is, let's say that I have a dataset that have a Column that label 0 and 1, I know that is my label. Hence, in some post ppl are saying segregate the dataset base on the label I to 2 set 1 is with label 0 and another is label 1. And by removing the column of label, we only used label 0 dataset (label removed) train with AE. And hence, as what u had mentioned, unseen data(label 1) will be having higher reconstruction error. So, my doubts are how to segregate out the data(preprocessing) if we using unlabeled dataset? – CodeNameBobby Jun 15 '19 at 08:10
0

I think you answered the question already yourself in part: The definition of an anomaly is that it should be considered "a rare event". So even if you don't know the labels, your training data will contain only very few such samples and predominantly learn on what the data usually looks like. So both during training as well as at prediction time, your error will be large for an anomaly. But since such examples should come up only very seldom, this will not influence your embedding much.

In the end, if you can really justify that the anomaly you are checking for is rare, you might not need much pre-processing or labelling. If it occurs more often (a threshold is hard to give for that, but I'd say it should be <<1%), your AE might pick up on that signal and you would really have to get the labels in order to split the data... . But then again: This would not be an anomaly any more, right? Then you could go ahead and train a (balanced) classifier with this data.

Baradrist
  • 157
  • 7