6

I'm trying to think of some code that will allow me to search through my ArrayList and detect any values outside the common range of "good values."

Example: 100 105 102 13 104 22 101

How would I be able to write the code to detect that (in this case) 13 and 22 don't fall within the "good values" of around 100?

Ashton
  • 119
  • 3
  • 4
  • 14
  • 8
    You're going to need to rigorously define what you mean by "good values". Are they values that are more than x-standard-deviations away from the average? Or something else? – Kon Sep 14 '13 at 18:43
  • 4
    This can be easily done with some `if` statements – user1231232141214124 Sep 14 '13 at 18:44
  • 1
    Read up on outlier detection: http://en.wikipedia.org/wiki/Outlier#Identifying_outliers – NPE Sep 14 '13 at 18:48
  • Could you explain how to do this with if statements @redFIVE. Thanks – Ashton Sep 15 '13 at 19:07
  • No. You need to learn how to do if statements – user1231232141214124 Sep 15 '13 at 19:08
  • 1
    @redFIVE I just wanted to make sure I was getting the right starting point. I understand that an if statement is a boolean comparison that only executes the statements within the block nested under the comparison if and only if the boolean comparison passes, returns a value of 1 rather than 0. However, thank you for your input. I thought about using if statements and just comparing inside a loop whether the two variables ever came out with a value greater than five or less then -5. However, I ran into a problem determining how to detect which element is the one that should be removed. – Ashton Sep 15 '13 at 19:13

9 Answers9

7

There are several criteria for detecting outliers. The simplest ones, like Chauvenet's criterion, use the mean and standard deviation calculated from the sample to determine a "normal" range for values. Any value outside of this range is deemed an outlier.

Other criterions are Grubb's test and Dixon's Q test and may give better results than Chauvenet's for example if the sample comes from a skew distribution.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • I'm not sure if I'm calculating the standard deviation wrong. In my JUnit, I had 10, 12, 11, 25, 13, 14 as my array. I calculated the standard deviation as being 5.----. I'm not certain how to interpret this answer to use in my data as a factor. – Ashton Sep 15 '13 at 19:06
7
package test;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        List<Double> data = new ArrayList<Double>();
        data.add((double) 20);
        data.add((double) 65);
        data.add((double) 72);
        data.add((double) 75);
        data.add((double) 77);
        data.add((double) 78);
        data.add((double) 80);
        data.add((double) 81);
        data.add((double) 82);
        data.add((double) 83);
        Collections.sort(data);
        System.out.println(getOutliers(data));
    }

    public static List<Double> getOutliers(List<Double> input) {
        List<Double> output = new ArrayList<Double>();
        List<Double> data1 = new ArrayList<Double>();
        List<Double> data2 = new ArrayList<Double>();
        if (input.size() % 2 == 0) {
            data1 = input.subList(0, input.size() / 2);
            data2 = input.subList(input.size() / 2, input.size());
        } else {
            data1 = input.subList(0, input.size() / 2);
            data2 = input.subList(input.size() / 2 + 1, input.size());
        }
        double q1 = getMedian(data1);
        double q3 = getMedian(data2);
        double iqr = q3 - q1;
        double lowerFence = q1 - 1.5 * iqr;
        double upperFence = q3 + 1.5 * iqr;
        for (int i = 0; i < input.size(); i++) {
            if (input.get(i) < lowerFence || input.get(i) > upperFence)
                output.add(input.get(i));
        }
        return output;
    }

    private static double getMedian(List<Double> data) {
        if (data.size() % 2 == 0)
            return (data.get(data.size() / 2) + data.get(data.size() / 2 - 1)) / 2;
        else
            return data.get(data.size() / 2);
    }
}

Output: [20.0]

Explanation:

  • Sort a list of integers, from low to high
  • Split a list of integers into 2 parts (by a middle) and put them into 2 new separate ArrayLists (call them "left" and "right")
  • Find a middle number (median) in both of those new ArrayLists
  • Q1 is a median from left side, and Q3 is the median from the right side
  • Applying mathematical formula:
  • IQR = Q3 - Q1
  • LowerFence = Q1 - 1.5*IQR
  • UpperFence = Q3 + 1.5*IQR
  • More info about this formula: http://www.mathwords.com/o/outlier.htm
  • Loop through all of my original elements, and if any of them are lower than a lower fence, or higher than an upper fence, add them to "output" ArrayList
  • This new "output" ArrayList contains the outliers
sklimkovitch
  • 251
  • 4
  • 8
  • this code is seriously bad. It assumes input is sorted. getMedian has a bug if data is null or data.getSize() == 1 – Mladen Adamovic Jul 16 '18 at 12:50
  • 7
    @MladenAdamovic: in general, code from Stackoverflow should be seen more as a guidance to other people than "production code, ready to be copied/pasted", at least, that's what professional engineers do. It is always easier to criticize based on edge cases than writing a full algorithm like sklimkovitch did. Like the popular song says: "be humble" ;-) – Clint Eastwood Aug 28 '18 at 17:26
4

An implementation of the Grubb's test can be found at MathUtil.java. It will find a single outlier, of which you can remove from your list and repeat until you've removed all outliers.

Depends on commons-math, so if you're using Gradle:

dependencies {
  compile 'org.apache.commons:commons-math:2.2'
}
Travis
  • 1,926
  • 1
  • 19
  • 26
1
  • find the mean value for your list
  • create a Map that maps the number to the distance from mean
  • sort values by the distance from mean
  • and differentiate last n number, making sure there is no injustice with distance
jmj
  • 237,923
  • 42
  • 401
  • 438
1

Use this algorithm. This algorithm uses the average and standard deviation. These 2 number optional values (2 * standardDeviation).

 public static List<int> StatisticalOutLierAnalysis(List<int> allNumbers)
            {
                if (allNumbers.Count == 0)
                    return null;

                List<int> normalNumbers = new List<int>();
                List<int> outLierNumbers = new List<int>();
                double avg = allNumbers.Average();
                double standardDeviation = Math.Sqrt(allNumbers.Average(v => Math.Pow(v - avg, 2)));
                foreach (int number in allNumbers)
                {
                    if ((Math.Abs(number - avg)) > (2 * standardDeviation))
                        outLierNumbers.Add(number);
                    else
                        normalNumbers.Add(number);
                }

                return normalNumbers;
            }
mesutpiskin
  • 1,771
  • 2
  • 26
  • 30
1

As Joni already pointed out , you can eliminate outliers with the help of Standard Deviation and Mean. Here is my code, that you can use for your purposes.

    public static void main(String[] args) {

    List<Integer> values = new ArrayList<>();
    values.add(100);
    values.add(105);
    values.add(102);
    values.add(13);
    values.add(104);
    values.add(22);
    values.add(101);

    System.out.println("Before: " + values);
    System.out.println("After: " + eliminateOutliers(values,1.5f));

}

protected static double getMean(List<Integer> values) {
    int sum = 0;
    for (int value : values) {
        sum += value;
    }

    return (sum / values.size());
}

public static double getVariance(List<Integer> values) {
    double mean = getMean(values);
    int temp = 0;

    for (int a : values) {
        temp += (a - mean) * (a - mean);
    }

    return temp / (values.size() - 1);
}

public static double getStdDev(List<Integer> values) {
    return Math.sqrt(getVariance(values));
}

public static List<Integer> eliminateOutliers(List<Integer> values, float scaleOfElimination) {
    double mean = getMean(values);
    double stdDev = getStdDev(values);

    final List<Integer> newList = new ArrayList<>();

    for (int value : values) {
        boolean isLessThanLowerBound = value < mean - stdDev * scaleOfElimination;
        boolean isGreaterThanUpperBound = value > mean + stdDev * scaleOfElimination;
        boolean isOutOfBounds = isLessThanLowerBound || isGreaterThanUpperBound;

        if (!isOutOfBounds) {
            newList.add(value);
        }
    }

    int countOfOutliers = values.size() - newList.size();
    if (countOfOutliers == 0) {
        return values;
    }

    return eliminateOutliers(newList,scaleOfElimination);
}
  • eliminateOutliers() method is doing all the work
  • It is a recursive method, which modifies the list with every recursive call
  • scaleOfElimination variable, which you pass to the method, defines at what scale you want to remove outliers: Normally i go with 1.5f-2f, the greater the variable is, the less outliers will be removed

The output of the code:

Before: [100, 105, 102, 13, 104, 22, 101]

After: [100, 105, 102, 104, 101]

Valiyev
  • 21
  • 1
  • 8
0

I'm very glad and thanks to Valiyev. His solution helped me a lot. And I want to shere my little SRP on his works.

Please note that I use List.of() to store Dixon's critical values, for this reason it is required to use Java higher than 8.

public class DixonTest {
protected List<Double> criticalValues = 
    List.of(0.941, 0.765, 0.642, 0.56, 0.507, 0.468, 0.437);
private double scaleOfElimination;
private double mean;
private double stdDev;

private double getMean(final List<Double> input) {
    double sum = input.stream()
            .mapToDouble(value -> value)
            .sum();
    return (sum / input.size());
}

  private double getVariance(List<Double> input) {
    double mean = getMean(input);
    double temp = input.stream()
            .mapToDouble(a -> a)
            .map(a -> (a - mean) * (a - mean))
            .sum();
    return temp / (input.size() - 1);
}

private double getStdDev(List<Double> input) {
    return Math.sqrt(getVariance(input));
}

protected List<Double> eliminateOutliers(List<Double> input) {
    int N = input.size() - 3;
    scaleOfElimination = criticalValues.get(N).floatValue();
    mean = getMean(input);
    stdDev = getStdDev(input);

    return input.stream()
            .filter(this::isOutOfBounds)
            .collect(Collectors.toList());
}

private boolean isOutOfBounds(Double value) {
    return !(isLessThanLowerBound(value)
            || isGreaterThanUpperBound(value));
}

private boolean isGreaterThanUpperBound(Double value) {
    return value > mean + stdDev * scaleOfElimination;
}

private boolean isLessThanLowerBound(Double value) {
    return value < mean - stdDev * scaleOfElimination;
}
}

I hope it will help someone else.

Best regard

0

Thanks to @Emil_Wozniak for posting the complete code. I struggled with it for a while not realizing that eliminateOutliers() actually returns the outliers, not the list with them eliminated. The isOutOfBounds() method also was confusing because it actually returns TRUE when the value is IN bounds. Below is my update with some (IMHO) improvements:

  • The eliminateOutliers() method returns the input list with outliers removed
  • Added getOutliers() method to get just the list of outliers
  • Removed confusing isOutOfBounds() method in favor of a simple filtering expression
  • Expanded N list to support up to 30 input values
  • Protect against out of bounds errors when input list is too big or too small
  • Made stats methods (mean, stddev, variance) static utility methods
  • Calculate upper/lower bounds only once instead of on every comparison
  • Supply input list on ctor and store as an instance variable
  • Refactor to avoid using the same variable name as instance and local variables

Code:

/**
 * Implements an outlier removal algorithm based on https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/dixon.htm#:~:text=It%20can%20be%20used%20to,but%20one%20or%20two%20observations).
 * Original Java code by Emil Wozniak at https://stackoverflow.com/questions/18805178/how-to-detect-outliers-in-an-arraylist
 * 
 * Reorganized, made more robust, and clarified many of the methods.
 */

import java.util.List;
import java.util.stream.Collectors;

public class DixonTest {
    protected List<Double> criticalValues = 
            List.of( // Taken from https://sebastianraschka.com/Articles/2014_dixon_test.html#2-calculate-q
                    // Alfa level of 0.1 (90% confidence)
                    0.941,  // N=3
                    0.765,  // N=4
                    0.642,  // ...
                    0.56,
                    0.507,
                    0.468,
                    0.437,
                    0.412,
                    0.392,
                    0.376,
                    0.361,
                    0.349,
                    0.338,
                    0.329,
                    0.32,
                    0.313,
                    0.306,
                    0.3,
                    0.295,
                    0.29,
                    0.285,
                    0.281,
                    0.277,
                    0.273,
                    0.269,
                    0.266,
                    0.263,
                    0.26     // N=30
                    );
    
    // Stats calculated on original input data (including outliers)
    private double scaleOfElimination;
    private double mean;
    private double stdDev;
    private double UB;
    private double LB;
    private List<Double> input;
    
    /**
     * Ctor taking a list of values to be analyzed. 
     * @param input
     */
    public DixonTest(List<Double> input) {
        this.input = input;
        
        // Create statistics on the original input data
        calcStats();
    }

    /**
     * Utility method returns the mean of a list of values.
     * @param valueList
     * @return
     */
    public static double getMean(final List<Double> valueList) {
        double sum = valueList.stream()
                .mapToDouble(value -> value)
                .sum();
        return (sum / valueList.size());
    }

    /**
     * Utility method returns the variance of a list of values.
     * @param valueList
     * @return
     */
    public static double getVariance(List<Double> valueList) {
        double listMean = getMean(valueList);
        double temp = valueList.stream()
                .mapToDouble(a -> a)
                .map(a -> (a - listMean) * (a - listMean))
                .sum();
        return temp / (valueList.size() - 1);
    }

    /**
     * Utility method returns the std deviation of a list of values.
     * @param input
     * @return
     */
    public static double getStdDev(List<Double> valueList) {
        return Math.sqrt(getVariance(valueList));
    }
    
    /**
     * Calculate statistics and bounds from the input values and store
     * them in class variables.
     * @param input
     */
    private void calcStats() {
        int N = Math.min(Math.max(0, input.size() - 3), criticalValues.size()-1); // Changed to protect against too-small or too-large lists
        scaleOfElimination = criticalValues.get(N).floatValue();
        mean = getMean(input);
        stdDev = getStdDev(input);
        UB = mean + stdDev * scaleOfElimination;
        LB = mean - stdDev * scaleOfElimination;        
    }

    /**
     * Returns the input values with outliers removed.
     * @param input
     * @return
     */
    public List<Double> eliminateOutliers() {

        return input.stream()
                .filter(value -> value>=LB && value <=UB)
                .collect(Collectors.toList());
    }

    /**
     * Returns the outliers found in the input list.
     * @param input
     * @return
     */
    public List<Double> getOutliers() {

        return input.stream()
                .filter(value -> value<LB || value>UB)
                .collect(Collectors.toList());
    }

    /**
     * Test and sample usage
     * @param args
     */
    public static void main(String[] args) {
        List<Double> testValues = List.of(1200.0,1205.0,1220.0,1194.0,1212.0);
        
        DixonTest outlierDetector = new DixonTest(testValues);
        List<Double> goodValues = outlierDetector.eliminateOutliers();
        List<Double> badValues = outlierDetector.getOutliers();
        
        System.out.println(goodValues.size()+ " good values:");
        for (double v: goodValues) {
            System.out.println(v);
        }
        System.out.println(badValues.size()+" outliers detected:");
        for (double v: badValues) {
            System.out.println(v);
        }
        
        // Get stats on remaining (good) values
        System.out.println("\nMean of good values is "+DixonTest.getMean(goodValues));
    }
}
user3191192
  • 169
  • 13
-1

It is just a very simple implementation which fetches the information which numbers are not in the range:

List<Integer> notInRangeNumbers = new ArrayList<Integer>();
for (Integer number : numbers) {
    if (!isInRange(number)) {
        // call with a predefined factor value, here example value = 5
        notInRangeNumbers.add(number, 5);
    }
}

Additionally inside the isInRange method you have to define what do you mean by 'good values'. Below you will find an examplary implementation.

private boolean isInRange(Integer number, int aroundFactor) {
   //TODO the implementation of the 'in range condition'
   // here the example implementation
   return number <= 100 + aroundFactor && number >= 100 - aroundFactor;
}
Łukasz Rzeszotarski
  • 5,791
  • 6
  • 37
  • 68
  • I really like your ideas, but I cannot use this in my program, specifically. The data set could be any set of numbers, but most will be around some value. Not knowing that value, is it still possible to do your method(s)? Thanks. – Ashton Sep 15 '13 at 19:04
  • @Dan What do you mean that the numbers are around some value, but don't know that value. I guess that the value has to be somehow hardcoded/ predefined. Can you please extend your question and describe what you really want to achieve, because as I see the comments it is not fully clear. – Łukasz Rzeszotarski Sep 15 '13 at 19:13
  • Sorry for not being clear. I just want to find a "ranged average," checking the data set from input first for outliers or anomalies, removing them from the arrayList, then calculating the average. – Ashton Sep 15 '13 at 19:16
  • @Dan Ok so it seems that you have to implement some criteria proposed by Joni. of course you can adapt my code to check if a number is an outlier however now it's clear where is the point. See https://gist.github.com/sushain97/6488296 there is some example of Chauvenet's Criterion for Outliers – Łukasz Rzeszotarski Sep 15 '13 at 19:27