-1

I need to calculate the median transaction amount for each customer in the past 52 weeks, but percentile_cont returns NULL if there's no transaction for a particular customer. In such a case, I have to replace NULL with zero, I acheived this by using a CASE clause in sql, however I am using PERCENTILE_CONT twice for this purpose which makes the query slow for a huge list of customers to process. is there a better way to use the PERCENTILE_CONT only once inside the CASE clause?

SELECT DISTINCT customer,

       CASE WHEN 
       PERCENTILE_CONT(0.5) 
       WITHIN GROUP (ORDER BY 
       transamt) OVER 
       (PARTITION BY
      customer) IS NOT NULL THEN

      PERCENTILE_CONT(0.5) WITHIN 
      GROUP (ORDER BY transamt) 
      OVER (PARTITION BY
      customer)

      ELSE 0
      END  AS median_amt

FROM trans_table

WHERE trans_date BETWEEN DATEADD(WEEK, -52, GETDATE() ) AND GETDATE() 
Arash
  • 141
  • 10

3 Answers3

1

I tried COALESCE() function as JHH suggested and didn't see much difference performance wise. it seems COALESCE() is internally a CASE statement.

However when I switched to:

isnull(percentile_cont(0.5) within group (order by transamt) over 
(partition by customer),0)

the computation time dropped by a factor of 2.

I don't know if this can be generalized to any scenarios or is it merely my particular query.

Arash
  • 141
  • 10
  • yes this is generally correct - there used to be a Microsoft connect item about this type of thing open since 2008 (way back machine link https://web.archive.org/web/20160617042004/https://connect.microsoft.com/SQLServer/feedback/details/336002/unnecessarily-bad-performance-for-coalesce-subquery) - but that site was retired years ago. – Martin Smith Dec 05 '22 at 23:52
0

According to "...percentile_cont returns NULL if there's no transaction for a particular customer...", it could be a data issue or by design there could be entries for a customer without any transactions could have trans_date in the past 52 weeks but transamt is null. If that's the case, maybe this work for you by changing order by transamt to the following:

select distinct 
       customer,
       percentile_cont(0.5) within group 
       (order by case when transamt is null then 0 else transamt end) 
       over (partition by customer) as median_amt
  from trans_table;

Second guess: if percentile_cont() return NULL then show 0. Using coalesce().

select distinct customer,
       coalesce(percentile_cont(0.5) within group (order by transamt) over (partition by customer),0) as median_amt
  from trans_table;
JHH
  • 1,331
  • 1
  • 2
  • 9
  • no, in your solution percentile_cont considers zero in place any null. what I'm expecting is to have zero instead of NULL if that customer has no transaction at all. otherwise, percentile_cont should do its expected calculation. – Arash Dec 05 '22 at 02:17
  • @Arash could you please present some sample data to illustrate what it looks like when a customer has no transaction at all in your table? – JHH Dec 05 '22 at 02:37
  • @Arash try coalesce(). Please see updated query. – JHH Dec 05 '22 at 04:32
  • thanks it seems coalesce () is the answer as you suggested. – Arash Dec 05 '22 at 04:56
0

There is a new solution to speed up things called APPROX_PERCENTILE_CONT but it only applies to:

  • SQL Server 2022 (16.x)
  • Azure SQL Database
  • Azure SQL Managed Instance

You can now use:

SELECT DISTINCT customer,

       CASE WHEN 
       APPROX_PERCENTILE_CONT(0.5) 
       WITHIN GROUP (ORDER BY 
       transamt) OVER 
       (PARTITION BY
      customer) IS NOT NULL THEN

      APPROX_PERCENTILE_CONT(0.5) WITHIN 
      GROUP (ORDER BY transamt) 
      OVER (PARTITION BY
      customer)

      ELSE 0
      END  AS median_amt

FROM trans_table

WHERE trans_date BETWEEN DATEADD(WEEK, -52, GETDATE() ) AND GETDATE() 

In this example it pushed the speed from ~13 seconds to ~1.5 seconds.

Francesco Mantovani
  • 10,216
  • 13
  • 73
  • 113