Presto equivalent for Redshift's PERCENTILE_DISC

Question

Given a query below in Redshift:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

I need to convert above Query to corresponding Presto syntax. Corresponding Presto query I wrote is:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

Here, everything is working fine but it is showing error in the line:

PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
    over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,

What will be its corresponding Presto Syntax?

score 0 · Answer 1 · answered Feb 14 '18 at 21:34

If Presto supported nested window functions then you could use NTH_VALUE along with p*COUNT(*) OVER (PARTITION BY ...) to find the offset corresponding to the "p'th" percentile in the window. Since Presto doesn't support this, you need to join to a subquery that calculates the number of records in the window instead:

SELECT
  my_table.window_column,
  /* Replace :p with the desired percentile (in your case, 0.02) */
  NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
    OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
  SELECT
    window_column,
    COUNT(*) AS records_in_window
  FROM my_table
  GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column

The above is conceptually close but fails because :p*subquery.records_in_window is a float and the offset needs to be an integer. You have a few options for how to deal with this this. For example, if you're finding the median then simply rounding to the nearest integer works. If you're finding the 2nd percentile, rounding won't work because it will often give you 0 and the offset starts at 1. In that case, rounding the ceiling to the nearest integer might be better.

frank · Answer 2 · 2020-04-23T19:21:58.520

0

I was doing some research on median in presto, and found a solution that worked for me:

For example, I had a join table, A_join_B, that has columns A_id and B_id.

I wanted to find median of number of A related to a single B

SELECT APPPROX_PERCENTILE(count, 0.5) FROM ( SELECT COUNT(*) AS count, narrative_id FROM A_join_B GROUP BY B_id ) as counts;

edited Apr 23 '20 at 19:21

answered Apr 23 '20 at 19:08

frank

1,283
1
19
39

Presto equivalent for Redshift's PERCENTILE_DISC

2 Answers2