35

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

For example, with the following example data:

a = 10
b = 14
c = 25
d = 467
e = 12

d is clearly an anomaly, and I would want to perform a specific action based on this.

I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.


Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.


Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

What is a recommended anomaly detection technique for simple, one-dimensional data?

Grundlefleck
  • 124,925
  • 25
  • 94
  • 111
  • Don't underestimate the value of scientific knowledge. Black box procedures are rarely the way to go. Try to express your scientific knowledge in terms of simple statistics. – Tristan Feb 20 '10 at 21:23
  • @Tristan: are you saying you think I should try to come up with a model which has some grounding in statistics, but ultimately is specific for my problem domain? – Grundlefleck Feb 21 '10 at 09:32
  • 1
    I'm just saying that your knowledge of what is reasonable (i.e., what is the model that generates the good data and bad data) is important information. You should design a procedure, such as using IQR, that is motivated by your scientific knowledge of the domain. I don't like things like k-means because it is not well motivated and is inherently inflexible, in my view. – Tristan Feb 21 '10 at 18:45

4 Answers4

49

Check out the three-sigma rule:

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

An alternative method is the IQR outlier test:

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier

this test is usually employed by Box plots (indicated by the whiskers):

boxplot


EDIT:

For your case (simple 1D univariate data), I think my first answer is well suited. That however isn't applicable to multivariate data.

@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

dbscan_clustering

DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

If the number of neighbors is less than minPoints, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

Finally the set of all points marked as noise are considered outliers.

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • 2
    +1 three-sigma and IQR look like good techniques, thanks for the insightful answer. – Grundlefleck Feb 21 '10 at 09:48
  • 2
    I like this simple advice. The IQR based statistic has the advantage of not being influenced by extreme outliers which will change the mean/sd. – Tristan Feb 21 '10 at 18:52
  • @Anony-Mousse: fixed, thanks. Funny enough I first learned about DBSCAN in a machine-learning class using Weka software/book – Amro Jul 13 '12 at 12:53
  • Yes, the Weka software and book are very widely used. Which is why it is a pity they made this error. Plus, the DBSCAN implementation in Weka is *really* crappy. It benchmarked way over 100x as slow as mine, and even slower as their OPTICS implementation? OPTICS should be quite a bit slower. – Has QUIT--Anony-Mousse Jul 13 '12 at 12:57
  • @Anony-Mousse: If you are willing and have the time, you could contribute your implementation to Weka. It is open sourced under GPL, and no I'm not affiliated with them in any way :) – Amro Jul 13 '12 at 13:05
  • I know too little on Weka, and everybody tells me it's too slow. Why start playing around with it more than a benchmark? – Has QUIT--Anony-Mousse Jul 13 '12 at 15:25
  • This answer saved my day. – Ashay Batwal Jul 08 '18 at 17:53
  • How does dbscan work in the univariate setup? Is it then boiled down to some KNN-algorithm? – CutePoison Jan 02 '19 at 08:07
  • @Jakob DBSCAN works the same way no matter the dimension size. Instead of the 2D points shown in the figure above, the instances would be 1D points all on a straight line. Now you can't really compare it to kNN (NN being a supervised classification algorithm as opposed to DBSCAN being an unsupervised clustering algorithm). – Amro Jan 02 '19 at 17:23
2

There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.

After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by @Amro as a good starting point.

For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

smaclell
  • 4,568
  • 7
  • 41
  • 49
  • Agreed. K-Means is a simple, effective, and adaptive solution for this problem. Create two clusters, initialize properly, and one of the clusters should contain the meaningful data while the other gets the outlier(s). But be careful; if you have no outliers, then both clusters will contain meaningful data. – Steve Tjoa Feb 20 '10 at 20:43
  • Well that is where it gets fun. It is often very difficult to determine the number of clusters and would be even harder doing it in a live system. Even in that case of one true cluster and another outlier cluster it could be argued the outliers are starting to represent a real mode for the data. I am going to add more links to provide other options. – smaclell Feb 20 '10 at 21:15
  • This strikes me as the wrong tool for the job. He's primarily interested in fat tails, not bimodal distributions. – Tristan Feb 20 '10 at 21:21
  • It depends on the asker's intent, so we cannot be completely sure. If the only intent is to assess how anomalous a data point is, then use simple statistics, of course. But if you want to, say, use the "good" data as an input to a subsequent function, then there may be value in classifying the points as "good" or "bad" (e.g., through K-means, etc.). – Steve Tjoa Feb 20 '10 at 22:09
  • But algorithms like k-means offer no ability for the user to define what anomalous means. The kmeans clusters are simply solutions to a very specific cluster definition. – Tristan Feb 20 '10 at 22:43
  • 3
    @Steve That is actually wrong. There is no reason why all the outliers should form a cluster. K-Means finds clusters for which the euclidean distance from its center is minimized - if the outliers are distributed evenly around the clusters, this will not help at all. The Euclidean distance results from a Gaussian assumption which is very vulnerable to outliers. Don't use K-Means for outlier detection only. You might want to use it for preprocessing and using three sigma afterwards, as stated by the original author. – bayer Feb 20 '10 at 23:05
  • @Tristan I am not sure if this is the wrong tool for the job but it is overkill for simple domains and having to pick a K still adds complexity. I just wanted to open the door for conventional clustering methods and how they could be applied to help determine outliers. Thanks you all for the discussion. – smaclell Feb 20 '10 at 23:32
  • I'm not saying K-means is unequivocally the right tool, but by no means is it unequivocally the wrong tool, and therefore I find the statement "that is actually wrong" a gross overgeneralization. The asker says "d is clearly an anomaly, and I would want to perform a specific action based on this." From this statement, I interpret a desire to assign hard labels: outlier, or not. First, three sigma is only an *empirical* choice! Is two sigma better? Or four? Neither of us know, because we don't know the data well enough. From the example given, I only see nonnegative data with ... – Steve Tjoa Feb 21 '10 at 01:36
  • ... the only outlier being >> 0. With that example, for most intelligent initializations of two cluster centers, K-means will correctly cluster (a,b,c,e) together; this can easily be verified by hand. Of course, if you also have nonnegative data, the outliers will not form a single cluster, I'm absolutely aware of that. But we don't know what the data is like. We don't know if it is uni- or multi-modal, how Gaussian it is or isn't, and its support. But I do know that there *exist* cases, even in one dimension, where K-means *will work*. – Steve Tjoa Feb 21 '10 at 01:41
1

This is an old topic but still it lacks some information.

Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:

  1. Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
  2. The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc. And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers) Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing

I think this problem can be solved in a few lines of python code like this:

import numpy as np
import scipy.stats as sts

x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion

Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:

array([ 0.6745  ,  0.      ,  1.854875, 76.387125,  0.33725 ])

or the 467 entry being rejected.

Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.

Yanni
  • 11
  • 1
0

Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.

The three-sigma rule is correct
mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

The IQR test should be:

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
If x >  Q75  + 1.5 * IQR or  x   < Q25 - 1.5 * IQR THEN  x is a mild outlier
If x >  Q75  + 3.0 * IQR or  x   < Q25 – 3.0 * IQR THEN  x is a extreme outlier
  • I just noticed this and you are right, my IQR test wasn't correct. I'll update my answer, thanks. – Amro Jan 22 '17 at 19:00