8

I'm trying to calculate the cumulative sum of columns t and s over a date from my sample data below, using Presto SQL.

Date   | T | S 
1/2/19 | 2 | 5
2/1/19 | 5 | 1
3/1/19 | 1 | 1

I would like to get

Date   | T | S | cum_T | cum_S 
1/2/19 | 2 | 5 |    2  |  5 
2/1/19 | 5 | 1 |    7  |  6
3/1/19 | 1 | 1 |    8  |  7

However when I run the below query using Presto SQL I am receiving an unexpected error message, telling me to put columns T and S into the group by section of my query.

Is this expected? When I remove the group by from my query it runs without error, but produces duplicate date rows. +

select
  date_trunc('day',tb1.date),
  sum(tb1.S) over (partition by date_trunc('day',tb1.date) order by date_trunc('day',tb1.date) rows unbounded preceding )  as cum_S,
  sum(tb1.T) over (partition by date_trunc('day',tb1.date) order by date_trunc('day',tb1.date) rows unbounded preceding)  as cum_T
from esi_dpd_bi_esds_prst.points_tb1_use_dedup_18months_vw tb1
where 
  tb1.reason_id not in (45,264,418,983,990,997,999,1574)
  and tb1.group_id not in (22)
  and tb1.point_status not in (3)
  and tb1.date between cast(DATE '2019-01-01' as date) and cast( DATE '2019-01-03' as date)
group by 
    1
order by date_trunc('day',tb1.date) desc 

Error looks like this:

Error: line 3:1: '"sum"(tb1.S) OVER (PARTITION BY "date_trunc"('day', tb1.tb1) ORDER BY "date_trunc"('day', tb1.tb1) ASC ROWS UNBOUNDED PRECEDING)' must be an aggregate expression or appear in GROUP BY clause.
user124123
  • 1,642
  • 7
  • 30
  • 50

1 Answers1

13

You have an aggregation query and you want to mix the aggregations with window functions. The correct syntax is:

select date_trunc('day', tb1.date),
       sum(tbl1.S) as S,
       sum(tbl1.T) as T,
       sum(sum(tb1.S)) over (order by date_trunc('day', tb1.date) rows unbounded preceding )  as cum_S,
       sum(sum(tb1.T)) over (order by date_trunc('day', tb1.date) rows unbounded preceding)  as cum_T
from esi_dpd_bi_esds_prst.points_tb1_use_dedup_18months_vw tb1
where tb1.reason_id not in (45, 264, 418, 983, 990, 997, 999, 1574) and
      tb1.group_id not in (22) and
      tb1.point_status not in (3) and
      tb1.date between cast(DATE '2019-01-01' as date) and cast( DATE '2019-01-03' as date)
group by 1
order by date_trunc('day', tb1.date) desc ;

That is, the window function is running after the aggregation and needs to process the aggregated value.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for your answer. This ran without error but produced sums for cum_S and cum_T, not cumulative sums, col 2 equaled col 4 and col 3 equaled col 5. – user124123 May 14 '19 at 00:33
  • 1
    @user124123 . . . I left your `partition by` clause in the calculation. That is clearly not necessary -- you want to order by the date, not partition by it. – Gordon Linoff May 14 '19 at 00:34
  • Yes this works now, I'm not quite sure if I understand your explanation after the code, do you know any resources I could look at to explain this in more detail? – user124123 May 14 '19 at 00:42
  • @user124123 . . . If you understand window functions, then just practice with using window functions with aggregation. – Gordon Linoff May 14 '19 at 00:47
  • @alwaysaskingquestions . . . No. One `sum()` is for aggregation. The second is for a window function. – Gordon Linoff Jun 04 '21 at 12:49