0

I have a table like this in hive

user_id     no.of game_plays
u1           52
u2           190
u10          166
u9           100
u3           90
u4           44
u5           21
u7           10
u8           5

Above is just a very small sample of data.

So, the total game_plays are 678

I want to calculate the users in each group like below

who contribute to top 33.3% of total game_plays and 
who contribute to between 33.3% and 66.6% of total game_plays 
who contribute to bottom 33.3% of total game_plays

Basically, split the data into 3 group like above and get top 20 users from each group.

I know the logic of how to implement in BigQuery like....get the percentile value ordered by game_plays and then put a case statement in the above query and rank using game_plays in each group and select rank <=20

which give the result I wanted.

I don't know how to implement this kind of thing in hive.

I have gone through the below pages, but not getting an idea

How to implement percentile in Hive?

How to calculate median in Hive

And have gone through the functions link below,

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

I know I have to percentile function...but knot sure exactly how I implement.

Below is the code I have tried,

select a.user_id,a.game_plays, percentile(a.game_plays,0.66) as percentile
from (
select user_id, sum(game_plays) as game_plays
from game_play_table
where data_date = '2019-06-01' 
group by user_id) a

I know the above code doesn't give exact give output, but after writing an outer query over it....I can get the output I wanted....but the above query output is itself very different.

Can anyone please help ???

Munagala
  • 69
  • 2
  • 12

1 Answers1

0

You could use "case" to calculate the percentile

select user_id,game_plays ,
case when (game_plays * (100 /678)) > 33.3 then 'top 33.3%'
when (game_plays * (100 /678)) > 33.3) and (game_plays * (100 /678)) < 66.6) then 'between 33.3% and 66.6%'
when (game_plays * (100 /678)) < 33.3) then 'less then 33.3%'
end as percentile 
from game_play_table
where data_date = '2019-06-01' 
group by user_id
M.achaibou
  • 91
  • 3
  • 15
  • Hi M.achaibou, I think the above puts every thing in 'less then 33.3%' unless until any single record having a value that contributes more than 33.3% then it changes the bucket.....just to be more clear....I want users who cumulatively make the top 33.3% of game_plays....did you understand the difference ? – Munagala Jul 08 '19 at 08:26