Finding the 'Outliers' in numeric data set

Question

I want to compare (sorty by) growth rates and disadvantage high rates with very low starting values.

Example:

1. Start: 1.000.000 End: 1.100.000 Growth: +10%

Start: 100.000 End: 120.000 Growth: +20%

3. Start: 1 End: 10 Growth: +900%

Start: 10 End: 15 Growth: +50%

Sorting just by growth, descending would result in: 900% (3.), 50% (4.), 20% (2.), 10% (1.)

But I want to have: 20% (2.), 10% (1.), 900% (3.), 50% (4.), because in my case the chance is high, that 3. and 4. are statistical outliers.

What's the best way to solve this problem and do I've to define a threshold for the start values?

Thanks!

Interesting question. Can you explain why you think 3 and 4 are outliers while 1 and 2 are not? What's your underlying assumption about the models by which these systems grow? — templatetypedef, Jan 14 '15 at 22:40
You have to give more detail about the conceptual model before we can make assessments about the best approach. Generally speaking you can have a classifier do this for you if you have enough data (the more the better). If you want manual rule-based classification you have to come up with rules yourself. — Reut Sharabani, Jan 14 '15 at 22:44
The values are twitter followers and I want to determine which user had the most significant growth. The hypothesis is that it's more easy for new users to achieve high grow rates. — root66, Jan 15 '15 at 15:26

score 3 · Accepted Answer · edited Jun 05 '18 at 21:49

Based on the description you have provided, the problem can be split into 2:

Finding and excluding Statistical Outliers from the data set
Sorting the resulting values in descending (or just in any) order

The general solution to the first problem and example using Microsoft Excel is described at : Statistical Outliers detection in Microsoft Excel worksheet (http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection). Following is a bit of theory and a sample pertinent to your case.

Finding "Outliers" in a data set could be done by calculating the deviation for each number, expressed as either a "Z-score" or "modified Z-score" and testing it against certain predefined threshold. Z-score typically refers to number of standard deviation relative to the statistical average (in other words, it's measured in "Sigmas"). Modified Z-score applies the median computation technique to measure the deviation and in many cases provides more robust statistical detection of outliers. Mathematically the Modified Z-score could be written (as suggested by Iglewicz and Hoaglin - see the referenced article) as:

Mi = 0.6745 * (Xi - Median(Xi)) / MAD,

where MAD stands for Median Absolute Deviation. Any number in a data set with the absolute value of modified Z-score exceeding 3.5 is considered an "Outlier". Modified Z-score could be used to detect outliers in Microsoft Excel worksheet pertinent to your case as described below.

Step 1. Open a Microsoft Excel worksheet and in Cells A1, A2, A3 and A4 enter the values: 900%, 50% 20% and 10%, correspondingly.

Step 2. In C1 enter the formula: =MEDIAN(A1:A4) . The value in this cell corresponds to the median calculated on a data set entered at step 1.

Step 3. In C2 enter the array formula: {=MEDIAN(ABS(MEDIAN(A1:A4)-A1:A4))} . As a reminder, in order to enter the array formula, select the cell, type the formula in Excel Formula Bar and then click on the combination: CTRL-SHIFT-ENTER (notice the curly brackets surrounding the expression, which indicates the array formula). The value in this cell (C2) corresponds to MAD.

Step 4. Enter the formula: =IF((0.6745*ABS(C$1-A1)>3.5*C$2), "OUTLIER", "NORMAL") in the first row of column B and extend it down to the 4th row. Final result of “Outlier’s detection” should appear in column B.

A       B           C
900%    OUTLIER     35%
50%     NORMAL      0.35
20%     NORMAL  
10%     NORMAL

thus the value 900% is found an "Outlier" while other values are OK. Sorting the result set will be just a trivial task.

Excel Worksheet example is included for the clarity of explanation. The algorithm itself could be implemented in any programming languages (VBA, C#, Java, etc). Hope this will help.

Looks good at the first sight, but the 50% could be also an "outlier", because my hypothesis is, that it's easier for data-entries with small starting values to achieve higher grow rates at the beginning. The values are twitter followers btw. Thanks! — root66, Jan 16 '15 at 20:09
@root66: The solution I provided above for finding stats 'Outliers' is based on NIST (USA) recommendation (see the reference article), thus widely accepted in scientific and engineering communities. You can set the 'Outlier' criteria pertinent to your particular use case. If you are satisfied with the answer, please mark it accepted. Thanks and regards, — Alexander Bell, Jan 17 '15 at 02:52
@AlexBell Why the 0.6745 value is not used in your solution like in the actual formula: Mi=0.6745 * (Xi -Median(Xi)) / MAD ? — Nawaf Alsulami, Oct 29 '15 at 14:27

score 3 · Answer 2 · answered Jun 13 '16 at 15:33

my solition

private static List<double> StatisticalOutLierAnalysis(List<double> allNumbers)
{
    List<double> normalNumbers = new List<double>();
    List<double> outLierNumbers = new List<double>(); 
    double avg = allNumbers.Average();
    double standardDeviation = Math.Sqrt(allNumbers.Average(v => Math.Pow(v - avg, 2)));
    foreach (double number in allNumbers)
    {
        if ((Math.Abs(number - avg)) > (2 * standardDeviation))
            outLierNumbers.Add(number);
        else
            normalNumbers.Add(number);
    }

    return normalNumbers;
}

Finding the 'Outliers' in numeric data set

2 Answers2