2

Assume I have a table like this

id cnt tier
1 100 gold
2 200 silver
3 300 bronze
4 400 bronze
5 500 bronze
6 600 gold
7 700 silver
8 800 silver
9 900 silver
10 1000 silver

and if I want percentile rank over cnt within each tier (e.g. For tier 'gold', I have 2 rows, cnt = 100 and cnt = 600, so they're ranked at 0.0 percentile and 1.0 percentile respectively.) I can easily do PERCENT_RANK() OVER(PARTITION BY tier ORDER BY cnt) AS percentile1, thus

id cnt tier percentile1
1 100 gold 0.0
2 200 silver 0.0
3 300 bronze 0.0
4 400 bronze 0.5
5 500 bronze 1.0
6 600 gold 1.0
7 700 silver 0.25
8 800 silver 0.5
9 900 silver 0.75
10 1000 silver 1.0

Now what if I want a percentile rank over cnt but among all tiers' averages? To better illustrate, we have gold's avg = 350, silver's avg = 720, and bronze's avg = 400, so to get cnt's percentile rank among [350, 400, 720] for each row. i.e.

id cnt tier percentile2
1 100 gold 0.0
2 200 silver 0.0
3 300 bronze 0.0
4 400 bronze 0.333
5 500 bronze 0.666
6 600 gold 0.666
7 700 silver 0.666
8 800 silver 1.0
9 900 silver 1.0
10 1000 silver 1.0

How should I write the (Hive) SQL? Thanks

xiaolong
  • 3,396
  • 4
  • 31
  • 46

4 Answers4

0

I'd compute the averages and join it to the original table. To rank with respect to the averages, I added a placeholder row to hold the original row.

select * from (
  select *, PERCENT_RANK() OVER(PARTITION BY id order by avg_cnt) AS percentile1 from (
    select 
        t2.*, 
        t1.avg_tier, 
        case t1.avg_tier == "original_row" when true 
            then t2.cnt else t1.avg_cnt end as avg_cnt
    from YOUR_TABLE t2
    join (
      -- find averages and add a placeholder row 
      select tier as avg_tier, avg(cnt) as avg_cnt from YOUR_TABLE group by 1 
      union
      select "original_row", -1 as avg_cnt  -- placeholder row
    ) t1
  ) t3
) t4
where avg_tier="original_row"; -- filter the original row

Before using PERCENT_RANK(), this is what the joined table looks like.

t2.id t2.cnt t2.tier t1.avg_tier avg_cnt
1 100 gold original_row 100
1 100 gold bronze 400
1 100 gold gold 350
1 100 gold silver 720

...

Clover
  • 11
  • 6
  • no join condition? – xiaolong Apr 17 '22 at 02:42
  • No join condition here. The design is that for each value in the original table, I'll compare it to the tiers' averages. In the example, the `joined table` shows there are 3 rows for tiers' averages, plus additional row for the value in the original table. My aim is to build a table so I can utilize `percent_rank() over partition` . – Clover Apr 18 '22 at 05:55
0

Try something like this. First you use CTE to compute averages, then use it to obtain the desired percentile ranks.

Update: I also added a correction to deal with ties.

with cte as
(
select tier, avg(cnt) avg_cnt
from mytable
group by tier
)
select id,cnt,tier,
((select cast(count(*) as double) from cte where t.cnt > avg_cnt) 
+ (select cast(count(*)/2 as double) from cte where t.cnt = avg_cnt) 
)
/ (select cast(count(*) as double) from cte)
from mytable t
Igor N.
  • 394
  • 2
  • 3
0

I don't have hive in my machine. do we have crossapply in Hive or equivalent.

Can try something like this and apply rant on avg percentage.

select t1.*, 1.00*t1.cnt/a.avg_cnt
from test t1
cross apply (select tier, avg(cnt) avg_cnt
             from test t2 
             where t1.tier=t2.tier
             group by tier) A
Roshan Nuvvula
  • 803
  • 1
  • 8
  • 28
0

Here's a solution with one CTE, and one Window. I've named the main table tests.

with ranks as (
  select 
    avg(cnt) as t1,
    lead(avg(cnt)) over w as t2,
    percent_rank() over w as percentile2
  from (select * from tests union select null, -1, null) as n
  group by tier
  window w as (order by avg(cnt))
)
select tests.*, percentile2
from tests 
join ranks 
  on (tests.cnt > t1 and (tests.cnt <= t2 or t2 is null))

Results:

| id  | cnt  | tier   | percentile2        |
| --- | ---- | ------ | ------------------ |
| 1   | 100  | gold   | 0                  |
| 2   | 200  | silver | 0                  |
| 3   | 300  | bronze | 0                  |
| 4   | 400  | bronze | 0.3333333333333333 |
| 5   | 500  | bronze | 0.6666666666666666 |
| 6   | 600  | gold   | 0.6666666666666666 |
| 7   | 700  | silver | 0.6666666666666666 |
| 8   | 800  | silver | 1                  |
| 9   | 900  | silver | 1                  |
| 10  | 1000 | silver | 1                  |

This SQL is using Postgres syntax, but it should still work in Hive.

The idea is that you first create your ranks by averaging the tier counts and adding an additional dummy row so that it creates 4 ranks instead of 3. That's the union. The lead makes it so that each row in the CTE contains the lower and upper bounds of the rank (t1 and t2).

Then it's just a matter of joining on that CTE where the cnt falls between t1 and t2.

TonyArra
  • 10,607
  • 1
  • 30
  • 46