11

Imagine the following column calledd id:

68 69 43 54 56 61 69 70 71 72 77 78 79 85 87 88 89 93 95 96 98 99 99 62 66

If I do the following: percentile(id, 0.9), the output is 97.2. What is going on?

Mario Becerra
  • 514
  • 1
  • 6
  • 16
Pratik Garg
  • 747
  • 2
  • 9
  • 21
  • 2
    Also interesting to know how hive decentralizes this operation efficiently – Allen Wang Jan 15 '19 at 23:22
  • 1
    In addition to Andrea Romagnoli's answer I would like to mention that one of the common uses of percentile is finding the median value like so: percentile(id, 0.5) – Ilya Jan 25 '17 at 15:37

1 Answers1

13

If you put 0.9, you expect that the 90% of the data you give to the function will be under the returned value. 90% of 25 is approximately 22.5, and 97.2 can be a correct answer, because the four highest values are 99 99 98 96 in your set, and 97.2 is between the 22nd (96) and the 23rd (98) ordered numbers.

Andrea
  • 4,262
  • 4
  • 37
  • 56