8

Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to 9, essentially saying that an item with a char_percentile_rank of 0 is in the bottom 10% of items, and a value of 9 is in the top 10% of items. Is there a better way of doing this?

select item
    , characteristic
    , case when characteristic <= char_perc[0] then 0
        when characteristic <= char_perc[1] then 1
        when characteristic <= char_perc[2] then 2
        when characteristic <= char_perc[3] then 3
        when characteristic <= char_perc[4] then 4
        when characteristic <= char_perc[5] then 5
        when characteristic <= char_perc[6] then 6
        when characteristic <= char_perc[7] then 7
        when characteristic <= char_perc[8] then 8
        else 9
      end as char_percentile_rank
from (
    select split(item_id,'-')[0] as item
        , split(item_id,'-')[1] as characteristic
        , char_perc
    from (
        select collect_set(concat_ws('-',item,characteristic)) as item_set
            , PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc
        from(
            select item
                , sum(characteristic) as characteristic
            from table
            group by item
        ) t1
    ) t2
    lateral view explode(item_set) explodetable as item_id
) t3

Note: I had to do the collect_set in order to avoid a self join, as the percentile function implicitly performs a group by.

I've gathered that the percentile function is horribly slow (at least in this usage). Perhaps it would be better to manually calculate percentile?

Charlie Haley
  • 4,152
  • 4
  • 22
  • 36

1 Answers1

6

Try removing one of your derived tables

select item
    , characteristic
    , case when characteristic <= char_perc[0] then 0
        when characteristic <= char_perc[1] then 1
        when characteristic <= char_perc[2] then 2
        when characteristic <= char_perc[3] then 3
        when characteristic <= char_perc[4] then 4
        when characteristic <= char_perc[5] then 5
        when characteristic <= char_perc[6] then 6
        when characteristic <= char_perc[7] then 7
        when characteristic <= char_perc[8] then 8
        else 9
      end as char_percentile_rank
from (
     select item, characteristic,
         , PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over () as char_perc 
     from (
       select item
         , sum(characteristic) as characteristic             
       from table
       group by item            
     ) t1
) t2
FuzzyTree
  • 32,014
  • 3
  • 54
  • 85
  • The percentile function needs to have a group by - that's why I did the seemingly unnecessary collect_set(concat_ws(...)) - in order to make it one group. If I instead group by item and characteristic, then it percentiles by item and characteristic, which I do not want. Any ideas on how to avoid that? – Charlie Haley Aug 11 '15 at 13:41
  • 1
    I figured it out almost immediately after sleeping on it and coming back to it. Just add over() on the percentile. So the line should read: PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over() as char_perc. But it seems like both my original query and the new one have very similar performance. – Charlie Haley Aug 12 '15 at 14:03
  • @CharlieHaley is it possible to reference `PERCENTILE(BIGINT...)[i]` where `i = 0-8`? If so, you can move the case statement and remove 1 derived table – FuzzyTree Aug 12 '15 at 14:20
  • 1
    @CharlieHaley Another suggestion, is it possible to move `PERCENTILE(BIGINT(sum(characteristic),...) over () as characteristic` into the innermost query? Because I think window functions run after aggregation – FuzzyTree Aug 12 '15 at 14:26
  • I'll try it both ways and see if either produces a cleaner tree. – Charlie Haley Aug 12 '15 at 14:31
  • 1
    It looks like moving the percentile in either direction (inward or outward) does not affect the syntax tree, so it shouldn't affect performance either. But it does make the query a bit better looking, so there's that. – Charlie Haley Aug 12 '15 at 14:49