3

I am not 100% sure if this is a bug or I am not doing something right but if you give Percentile a large amount of data that is the consistent of the same value (see code below) the evaluate method takes a very long time. If you give Percentile the random values evaluate takes a considerable shorter time.

As noted below Median is a subcalss of Percentile.

Percentile java doc

private void testOne(){
  int size = 200000;
  int sameValue = 100;
  List<Double> list = new ArrayList<Double>();

  for (int i = 0; i < size; i++)
  {
    list.add((double)sameValue);
  }
  Median m = new Median();
  m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));

  long start = System.currentTimeMillis();
  System.out.println("Start:"+ start);

  double result = m.evaluate();

  System.out.println("Result:" + result);
  System.out.println("Time:"+ (System.currentTimeMillis()- start));
}


private void testTwo(){
  int size = 200000;
  List<Double> list = new ArrayList<Double>();

  Random r = new Random();

  for (int i = 0; i < size; i++)
  {
    list.add(r.nextDouble() * 100.0);
  }
  Median m = new Median();
  m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));

  long start = System.currentTimeMillis();
  System.out.println("Start:"+ start);

  double result = m.evaluate();

  System.out.println("Result:" + result);
  System.out.println("Time:"+ (System.currentTimeMillis()- start));
}
Dimitry
  • 4,503
  • 6
  • 26
  • 40

2 Answers2

4

This is a known issue between versions 2.0 and 2.1 and has been fixed for version 3.1.

Version 2.0 did indeed involve sorting the data, but in 2.1 they seemed to have switched to a selection algorithm. However, a bug in their implementation of that led to some bad behavior for data with lots of identical values. Basically they used >= and <= instead of > and <.

Michael McGowan
  • 6,528
  • 8
  • 42
  • 70
3

It's well known that some algorithms can exhibit slower performance for certain data sets. Performance can actually be improved by randomizing the data set before performing the operation.

Since percentile probably involves sorting the data, I'm guessing that your "bug" is not really a defect in the code, but rather the manifestation of one of the slower performing data sets.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 2
    Second that. Given that Percentile uses sorted array, it may be a manifestation of an edge case in the sorting algorithm. Classic quick sort, for instance, takes enormous time to sort an already sorted array. – Vladimir Dyuzhev Apr 03 '11 at 20:26
  • 1
    The java doc for Percentile says that Arrays.sort(double[]) is used. If you run Arrays.sort(double[]) it returns rather quickly. – Dimitry Apr 04 '11 at 03:47
  • While a good guess, this is indeed a bug. See [my answer](http://stackoverflow.com/a/13650749/387852). – Michael McGowan Nov 30 '12 at 18:11