use of frequency argument in percentile function in spark sql

Question

I'm trying to use the percentile function in spark-SQL.

Data:

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05) from tblname

output: 106.9

If I use the code below the value I get of percentile is incorrect.

select percentile('col1', .05, 2) from tblname

output: 24.91000000000001

But if I use the below code I get the expected reply (but I don't know why and how)

select percentile('col1', .05, 100) from tblname

Output: 15.8

Can anyone help me understand how the last argument changes things? Any documentation? I checking out spark source code docstring (as I'm not aware of scala) but no luck. Nothing on the official website either.

percentile(col, percentage [, frequency]) - Returns the exact percentile value > of numeric column col at the given percentage. The value of percentage must be > between 0.0 and 1.0. The value of frequency should be positive integral

Link

Using git blame on Spark source code, I found this: https://github.com/apache/spark/pull/16497 — mnicky, Apr 21 '21 at 14:49

score 0 · Answer 1 · answered Sep 22 '22 at 18:48

The frequency argument specifies how many times an element should be counted, so when you specify frequency 100, each element is counted 100 times.

This allows each distinct percentile value to have a specific item it can map to, which removes the need for interpolation.

Note, that you can always find a percentile that will result in interpolation, giving you an incorrect value. For example, in your case, try to get percentile 0.0901, ie, 9.01 percentile.

use of frequency argument in percentile function in spark sql

1 Answers1