3

I want to calculate the cumulative percentiles for a given partition/group in SQL. For example the input data looks like -

CustID     Product ID     quantity_purchased    
1          111                2
2          111                3
3          111                2 
4          111                5
1          222                2
2          222                6
4          222                7
6          222                2

I want to get cumulative percentiles on each product ID group. The output should be -

Product ID    min      25%      50%      75%     max
    111        2        2       2.5      3.5      5
    222        2        2       2.5      5.25     7     

So, basically for Product ID 111 I need to take the percentiles of quantity_purchased for only product ID 111 but as I proceed further in the column the percentiles should be cumulative meaning for product ID 222 the percentiles will be calculated considering quantity_purchased values of both Product ID 111 and product ID 222 (2,3,2,5,2,6,7,2). Similarly, if there was product ID 333 in data then for product ID 333, I would compute the percentiles based on all of the quantity_purchased values associated with product 111, product 222, and product 333 and store the result in the product 333 row. How to achieve this using SQL?

User
  • 57
  • 6
  • What does 2.5, 5.25 mean? – AIMIN PAN Jul 14 '21 at 03:41
  • @AIMINPAN so to calculate percentiles for Product ID 222, I will consider all quantity_purchased values of Product ID 111 and product ID 222 (2,3,2,5,2,6,7,2) and do a percentile on all of those. I added a bit more details to the question – User Jul 14 '21 at 06:01

2 Answers2

2

This is very curious, but I think you need to expand the data for each product id:

select t.product_id, min(t2.quantity_purchased), max(t2.quantity_purchased),
       percentile_cont(0.25) within group (order by t2.quantity_purchased),
       percentile_cont(0.50) within group (order by t2.quantity_purchased),
       percentile_cont(0.75) within group (order by t2.quantity_purchased)
from t join
     t t2
     on t2.product_id <= t.product_id
group by t1.product_id;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Nice one Gordon! I started going down the cumulative hole which halted progress. Great answer and awesome tricky question. – Adrian White Jul 17 '21 at 02:43
0

This used PERCENTILE_CONT instead of PERCENTILE_DISC the key difference being returned val is based on continuous distribution using linear interpolation where values don't line up perfect -depending on your use case this may provide more accurate data points. :-)

select
    ProductID,
    min(Quantity_Purchased::float) min,
    PERCENTILE_CONT(.25) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "25%",
    PERCENTILE_CONT(.50) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "50%",
    PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "75%" ,
    max(Quantity_Purchased) max
from
    cte
group by
    1

enter image description here

Copy|Paste|Run in Snowflake

with cte as (
select
    1 CustID,
    111 ProductID,
    2 Quantity_Purchased
union
select
    2 CustID,
    111 ProductID,
    3 Quantity_Purchased
union
select
    3 CustID,
    111 ProductID,
    2 Quantity_Purchased
union
select
    4 CustID,
    111 ProductID,
    5 Quantity_Purchased
union
select
    1 CustID,
    222 ProductID,
    2 Quantity_Purchased
union
select
    2 CustID,
    222 ProductID,
    6 Quantity_Purchased
union
select
    4 CustID,
    222 ProductID,
    7 Quantity_Purchased
union
select
    6 CustID,
    222 ProductID,
    2 Quantity_Purchased
)  
select
    ProductID,
    min(Quantity_Purchased::float) min,
    PERCENTILE_CONT(.25) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "25%",
    PERCENTILE_CONT(.50) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "50%",
    PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "75%" ,
    max(Quantity_Purchased) max
from
    cte
group by
    1
Adrian White
  • 1,720
  • 12
  • 14
  • This isn't the expected output. For Product ID 222 the expected out is 2.5 for 50%tile and 5.25 for 75%tile. (check the expected output in question) So, basically for Product ID 111 I need to take the percentiles of `quantity_purchased` for only product ID 111 but when we go to product ID 222 the percentiles will be cumulative meaning the percentiles will be calculated considering `quantity_purchased` values of both Product ID 111 and product ID 222 – User Jul 14 '21 at 05:51