I have a table like this in hive
user_id no.of game_plays
u1 52
u2 190
u10 166
u9 100
u3 90
u4 44
u5 21
u7 10
u8 5
Above is just a very small sample of data.
So, the total game_plays are 678
I want to calculate the users in each group like below
who contribute to top 33.3% of total game_plays and
who contribute to between 33.3% and 66.6% of total game_plays
who contribute to bottom 33.3% of total game_plays
Basically, split the data into 3 group like above and get top 20 users from each group.
I know the logic of how to implement in BigQuery like....get the percentile value ordered by game_plays and then put a case statement in the above query and rank using game_plays in each group and select rank <=20
which give the result I wanted.
I don't know how to implement this kind of thing in hive.
I have gone through the below pages, but not getting an idea
How to implement percentile in Hive?
How to calculate median in Hive
And have gone through the functions link below,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
I know I have to percentile function...but knot sure exactly how I implement.
Below is the code I have tried,
select a.user_id,a.game_plays, percentile(a.game_plays,0.66) as percentile
from (
select user_id, sum(game_plays) as game_plays
from game_play_table
where data_date = '2019-06-01'
group by user_id) a
I know the above code doesn't give exact give output, but after writing an outer query over it....I can get the output I wanted....but the above query output is itself very different.
Can anyone please help ???