Percentile rank over multiple partitions averages

Question

Assume I have a table like this

id	cnt	tier
1	100	gold
2	200	silver
3	300	bronze
4	400	bronze
5	500	bronze
6	600	gold
7	700	silver
8	800	silver
9	900	silver
10	1000	silver

and if I want percentile rank over cnt within each tier (e.g. For tier 'gold', I have 2 rows, cnt = 100 and cnt = 600, so they're ranked at 0.0 percentile and 1.0 percentile respectively.) I can easily do PERCENT_RANK() OVER(PARTITION BY tier ORDER BY cnt) AS percentile1, thus

id	cnt	tier	percentile1
1	100	gold	0.0
2	200	silver	0.0
3	300	bronze	0.0
4	400	bronze	0.5
5	500	bronze	1.0
6	600	gold	1.0
7	700	silver	0.25
8	800	silver	0.5
9	900	silver	0.75
10	1000	silver	1.0

Now what if I want a percentile rank over cnt but among all tiers' averages? To better illustrate, we have gold's avg = 350, silver's avg = 720, and bronze's avg = 400, so to get cnt's percentile rank among [350, 400, 720] for each row. i.e.

id	cnt	tier	percentile2
1	100	gold	0.0
2	200	silver	0.0
3	300	bronze	0.0
4	400	bronze	0.333
5	500	bronze	0.666
6	600	gold	0.666
7	700	silver	0.666
8	800	silver	1.0
9	900	silver	1.0
10	1000	silver	1.0

How should I write the (Hive) SQL? Thanks

Clover · Answer 1 · 2022-04-09T12:42:55.517

I'd compute the averages and join it to the original table. To rank with respect to the averages, I added a placeholder row to hold the original row.

select * from (
  select *, PERCENT_RANK() OVER(PARTITION BY id order by avg_cnt) AS percentile1 from (
    select 
        t2.*, 
        t1.avg_tier, 
        case t1.avg_tier == "original_row" when true 
            then t2.cnt else t1.avg_cnt end as avg_cnt
    from YOUR_TABLE t2
    join (
      -- find averages and add a placeholder row 
      select tier as avg_tier, avg(cnt) as avg_cnt from YOUR_TABLE group by 1 
      union
      select "original_row", -1 as avg_cnt  -- placeholder row
    ) t1
  ) t3
) t4
where avg_tier="original_row"; -- filter the original row

Before using PERCENT_RANK(), this is what the joined table looks like.

t2.id	t2.cnt	t2.tier	t1.avg_tier	avg_cnt
1	100	gold	original_row	100
1	100	gold	bronze	400
1	100	gold	gold	350
1	100	gold	silver	720

...

No join condition here. The design is that for each value in the original table, I'll compare it to the tiers' averages. In the example, the `joined table` shows there are 3 rows for tiers' averages, plus additional row for the value in the original table. My aim is to build a table so I can utilize `percent_rank() over partition` . — Clover, Apr 18 '22 at 05:55

Igor N. · Answer 2 · 2022-04-16T03:19:17.287

Try something like this. First you use CTE to compute averages, then use it to obtain the desired percentile ranks.

Update: I also added a correction to deal with ties.

with cte as
(
select tier, avg(cnt) avg_cnt
from mytable
group by tier
)
select id,cnt,tier,
((select cast(count(*) as double) from cte where t.cnt > avg_cnt) 
+ (select cast(count(*)/2 as double) from cte where t.cnt = avg_cnt) 
)
/ (select cast(count(*) as double) from cte)
from mytable t

score 0 · Answer 3 · answered Apr 21 '22 at 09:19

I don't have hive in my machine. do we have crossapply in Hive or equivalent.

Can try something like this and apply rant on avg percentage.

select t1.*, 1.00*t1.cnt/a.avg_cnt
from test t1
cross apply (select tier, avg(cnt) avg_cnt
             from test t2 
             where t1.tier=t2.tier
             group by tier) A

TonyArra · Answer 4 · 2022-04-24T18:34:05.697

Here's a solution with one CTE, and one Window. I've named the main table tests.

with ranks as (
  select 
    avg(cnt) as t1,
    lead(avg(cnt)) over w as t2,
    percent_rank() over w as percentile2
  from (select * from tests union select null, -1, null) as n
  group by tier
  window w as (order by avg(cnt))
)
select tests.*, percentile2
from tests 
join ranks 
  on (tests.cnt > t1 and (tests.cnt <= t2 or t2 is null))

Results:

| id  | cnt  | tier   | percentile2        |
| --- | ---- | ------ | ------------------ |
| 1   | 100  | gold   | 0                  |
| 2   | 200  | silver | 0                  |
| 3   | 300  | bronze | 0                  |
| 4   | 400  | bronze | 0.3333333333333333 |
| 5   | 500  | bronze | 0.6666666666666666 |
| 6   | 600  | gold   | 0.6666666666666666 |
| 7   | 700  | silver | 0.6666666666666666 |
| 8   | 800  | silver | 1                  |
| 9   | 900  | silver | 1                  |
| 10  | 1000 | silver | 1                  |

This SQL is using Postgres syntax, but it should still work in Hive.

The idea is that you first create your ranks by averaging the tier counts and adding an additional dummy row so that it creates 4 ranks instead of 3. That's the union. The lead makes it so that each row in the CTE contains the lower and upper bounds of the rank (t1 and t2).

Then it's just a matter of joining on that CTE where the cnt falls between t1 and t2.

Percentile rank over multiple partitions averages

4 Answers4