How to simulate BigQuery's quantiles in Hive

Question

I want to simulate BigQuery's QUANTILES function in Hive.

Data set: 1,2,3,4

BigQuery's query result will return value 2

select nth(2, quantiles(col1, 3))

But in Hive:

select percentile(col1, 0.5)

I've got 2.5

Note: I've got same result for odd number of records.

Is there any adequate Hive's udf functions?

`QUANTILES` is a statistical approximation. The analogue when using standard SQL with BigQuery is called `APPROX_QUANTILES`. Why would you want an approximate median if Hive is giving you an exact value? — Elliott Brossard, Sep 07 '17 at 14:23
Thank you Elliott for your quick response. I've got better result for my use case with approximate median. As I understand there isn't approx_quantiles function in Hive. — goRunToStack, Sep 07 '17 at 14:33

score 0 · Answer 1 · answered Sep 07 '17 at 15:53

I guess what you are looking for is the percentile_approx UDF.

This page gives you the list of all built-in UDFs in Hive.

percentile_approx(DOUBLE col, p [, B])

Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

The result from percentile_approx vary and it depends on number of rows and B parameter. But BigQuery query result is consistent. — goRunToStack, Sep 08 '17 at 08:50

How to simulate BigQuery's quantiles in Hive

1 Answers1