1

Suppose you have a very large list of numbers which would be expensive to sort. They are real numbers/decimals but all lie in the same range, say 0 to n for some integer n. Are there any methods for estimating percentiles that don't require sorting the data i.e. an algorithm that has better complexity than the fastest sorting algorithm.

Note: The tag is quantiles only because there is no existing tag for percentiles and it wouldn't let me create one; my question is not specific to quantiles.

tshepang
  • 12,111
  • 21
  • 91
  • 136
pwerth
  • 210
  • 1
  • 6
  • 14
  • Quantiles is less specific than percentiles. – Peter Flom Jun 25 '14 at 17:53
  • a quantile is a specific percentile – pwerth Jun 25 '14 at 17:56
  • No, it isn't. A percentile is a specific quantile. Quantiles could be percentiles or they could be smaller than percentiles (but we don't have common words for them, although 1 in 1000 is a permille). E.g. if you wanted a value that was bigger than all but 1 in a million, that would be a quantile but not an exact percentile. [Quantiles](http://en.wikipedia.org/wiki/Quantile). Percentiles are specific quantiles. – Peter Flom Jun 25 '14 at 18:01
  • whooops i assumed quantile = 20,40,60,80,100 percentile like how quartile = 25,50,75,100. if it's not obvious at this point i do not have a background in statistics... sorry about that! – pwerth Jun 25 '14 at 18:34
  • 1
    There actually _is_ a name for when we divide the data in five groups: they're called _quintiles_. Just one letter different from "quantiles". – David K Jun 25 '14 at 18:41
  • Terminology only: I'm happy with the idea of e.g. the 2.5% point as a percentile. I don't see why the percents must be integers, although using integers is certainly a very common convention. I agree that quantile is the general term here. – Nick Cox Jun 25 '14 at 19:01

1 Answers1

0

In order to find the p-th percentile of a set of N numbers, essentially you are trying to find the k-th largest number where k = N*p/100 (rounded down, I think--or on second thought, thinking of the median, for example, maybe it's rounded up).

You might try the median of medians algorithm, which is supposed to be able to find the k-th largest number among N numbers in O(N) time. I don't know where this is implemented in a standard library but a proposed implementation was posted in one of the answers to this question.

Community
  • 1
  • 1
David K
  • 3,147
  • 2
  • 13
  • 19