4

I'm trying to use the percentile function in spark-SQL.

Data:

col1
----
198
15.8
198
198
198
198
198
198
198
198
198

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05) from tblname

output: 106.9

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05, 2) from tblname

output: 24.91000000000001

But if I use the below code I get the expected reply (but I don't know why and how)

select percentile('col1', .05, 100) from tblname

Output: 15.8

Can anyone help me understand how the last argument changes things? Any documentation? I checking out spark source code docstring (as I'm not aware of scala) but no luck. Nothing on the official website either.

percentile(col, percentage [, frequency]) - Returns the exact percentile value > of numeric column col at the given percentage. The value of percentage must be > between 0.0 and 1.0. The value of frequency should be positive integral

Link

ProgrammerPer
  • 1,125
  • 1
  • 11
  • 26
Vijay Jangir
  • 584
  • 3
  • 15
  • Using git blame on Spark source code, I found this: https://github.com/apache/spark/pull/16497 – mnicky Apr 21 '21 at 14:49

1 Answers1

0

The frequency argument specifies how many times an element should be counted, so when you specify frequency 100, each element is counted 100 times.

This allows each distinct percentile value to have a specific item it can map to, which removes the need for interpolation.

Note, that you can always find a percentile that will result in interpolation, giving you an incorrect value. For example, in your case, try to get percentile 0.0901, ie, 9.01 percentile.

bluesmoon
  • 3,918
  • 3
  • 25
  • 30