Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to 9, essentially saying that an item with a char_percentile_rank
of 0 is in the bottom 10% of items, and a value of 9 is in the top 10% of items. Is there a better way of doing this?
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select split(item_id,'-')[0] as item
, split(item_id,'-')[1] as characteristic
, char_perc
from (
select collect_set(concat_ws('-',item,characteristic)) as item_set
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc
from(
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2
lateral view explode(item_set) explodetable as item_id
) t3
Note: I had to do the collect_set
in order to avoid a self join, as the percentile function implicitly performs a group by
.
I've gathered that the percentile function is horribly slow (at least in this usage). Perhaps it would be better to manually calculate percentile?