11

I have an algorithm that calculates the percentile(85) with Apache Commons of a series of values (12 values), for a later evaluation with a threshold to make a decision. The result is similar to the one given by Excel, but not equal, and sometimes this is critical for my application because with excel the result doesn't pass the threshold and with Apache Commons Math in Java it does, so I get different outputs.

Here it is an example: Internet traffic (Mbps) every 2 hours

32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000

After dividing by 1000 Mb (the capacity of the cable) I calculate the percentil(85) of the Occupation:

Excel: 0,049153870117

Apache Commons Math: 0.05003126676104001

I have found that it is possible to change the implementation of the percentile (it does not exist an official one) with setPercentileImpl(), but I couldn't find any example of how to do this, or the Excel algorithm (which is the one I was told to achieve).

Any help about this will be welcomed.

Thank you.

mram888
  • 4,899
  • 5
  • 33
  • 59
Jav_Rock
  • 22,059
  • 20
  • 123
  • 164
  • 1
    Can you give an examples of the discrepancy you see? With data we can try ourselves? – Peter Lawrey May 10 '11 at 09:49
  • Sure, I will edit my question with data. – Jav_Rock May 10 '11 at 11:32
  • 1
    Note that the reason for the differet results is that the set of values is 12. For larger values, both algorithms would give similar results. My problem is how to implement excel's method in Java. – Jav_Rock May 10 '11 at 11:39
  • 1
    Good question, when I use linear interpolation I get `0.049004` (when I divide your numbers by 1e19) I don't see how 0.05000 could be an answer as it is more than the 11th largest and it should be less IMHO. – Peter Lawrey May 10 '11 at 11:50
  • 1
    If I use a simplified 3rd order polynomial interpolation I get 0.491584 – Peter Lawrey May 10 '11 at 11:53
  • I found this link http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm It is mentioned that Excel uses k+d= 1+p(N-1) instead of the most common method of k+d=p(N+1). If I find an implementation of the algorithm in Java it should be easy to find the part of the code where these variables are involved... – Jav_Rock May 10 '11 at 12:02
  • What can I write in google to find the code of a method? Are there any "key" words for that? – Jav_Rock May 10 '11 at 12:05

5 Answers5

10

The difference is subtle and due to assumptions. It is easiest to explain with the 3 element case. Suppose you have three elements(N=3) a=x[0] < b=x[1] < c=x[2]. Both the Apache and the Excel method say that element b is the 50th percentile (the median). However they differ for a and c.

The Apache method (and the method referenced by the NIST page) say a is the 25th percentile and c is the 75% percentile, because it divides the space up into N+1 blocks, that is, into quarters.

The Excel method says that a is the 0th percentile and c the 100th percentile, as the space is divided into N-1 blocks, that is, in half.

Because of this, if you want the Excel method and you don't want to code it yourself, you could just remove the smallest and the largest element from your array, and call the Apache method - it should give you exactly the same result except at percentiles beyond the end points.

If you want to code it yourself, an easy way is given below. Be aware of these issues:

  • this sorts the array (so changes it)
  • this takes O(N log(N)) time due to the sorting. The Apache method uses a fast selection algorithm so takes O(N) time (google "quickselect" if you want to know more)

Code (not tested or even compiled, but should give you an idea).

// warning - modifies data 
double excelPercentile(double [] data, double percentile) { array
    Arrays.sort(data);
    double index = percentile*(data.length-1);
    int lower = (int)Math.floor(index);
    if(lower<0) { // should never happen, but be defensive
       return data[0];
    }
    if(lower>=data.length-1) { // only in 100 percentile case, but be defensive
       return data[data.length-1);
    }
    double fraction = index-lower;
    // linear interpolation
    double result=data[lower] + fraction*(data[lower+1]-data[lower]);
    return result;
 }
gordon613
  • 2,770
  • 12
  • 52
  • 81
Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134
  • Thanks. I think I founded the source code of Commons implementation http://www.java2s.com/Open-Source/Java-Document/Science/Apache-commons-math-1.1-src/org/apache/commons/math/stat/descriptive/rank/Percentile.java.htm I will try by changing the value asigned to pos in line 199. This should work, and I won't have to change the structure of my code. – Jav_Rock May 10 '11 at 13:09
  • 1
    That looks like it ought to work. I assume to double pos = 1+ p * (n - 1) / 100; – Nick Fortescue May 10 '11 at 13:27
  • 1
    I assume you will create a new class, rather than just replacing the Percentile class in your Apache jar. This could be dangerous if other classes from the Apache jar use this method, and also means you will have to keep it up to date with future releases – Nick Fortescue May 10 '11 at 13:29
  • 1
    Almost! It should be pos=(1+p*(n-1))/100. Dividing by 100 is part of Commons method, isn't it? So it should divide the whole expression. – Jav_Rock May 10 '11 at 13:33
  • Yes, I am creating a new class, and I will try to use the method setPercentileImpl(the new class) in the code, in order to use the new algorithm. As soon as I can check the results I will tell if I get the same value as in Excel. – Jav_Rock May 10 '11 at 13:35
  • No, you were right, I get the same result as Excel with pos=p(n-1)/100 + 1. – Jav_Rock May 10 '11 at 14:17
5

Class org.apache.commons.math3.stat.descriptive.rank.Percentile already supports Excel style interpolation, you just need to enable it with EstimationType.R_7

public class PercentileExcel extends Percentile {
    public PercentileExcel() throws MathIllegalArgumentException {

    super(50.0,
          EstimationType.R_7, // use excel style interpolation
          NaNStrategy.REMOVED,
          new KthSelector(new MedianOf3PivotingStrategy()));
    }
}
abarisone
  • 3,707
  • 11
  • 35
  • 54
pbirnie
  • 183
  • 3
  • 7
5

The solution was creating a class PercentileExcel which is almost a copy of percentile from commons method except for a small change on how to caculate the position:

pos=(1+p*(n-1))/100;

Then you need to add this line to the code in order to use the new class for percentile:

setPercentileImpl(PercentileExcel);
Jav_Rock
  • 22,059
  • 20
  • 123
  • 164
4

There is no unique definition of a percentile computed from a data set. See the Wikipedia page for the most common definitions in use.

Rob Hyndman
  • 30,301
  • 7
  • 73
  • 85
2

The following alternative with no new class needed works in 3.6:

DescriptiveStatistics ds = new DescriptiveStatistics();
Percentile p = new Percentile(50.0).withEstimationType(EstimationType.R_7)
                .withNaNStrategy(NaNStrategy.REMOVED)
                .withKthSelector(new KthSelector(new 
                  MedianOf3PivotingStrategy()))
ds.setPercentileImpl(p);
harmonica141
  • 1,389
  • 2
  • 23
  • 27
Chris94
  • 312
  • 1
  • 12