SQL to calculate cumulative sum that resets based on previous value in a column in Hive

Question

I am trying to create a cumulative value with something like this

KEY1    Date_    VAL1    CUMU_VAL2
K1      D1      1       0
K1      D2      1       1
K1      D3      0       2
K1      D4      1       0
K1      D5      1       1

So, the issue is basically to keep on adding the value by 1 in column CUMU_VAL2 based on the previous row in VAL1, but this sum resets when the previous value in VAL1 column is zero. Basically if you do it in excel the formula for say Cell(D3) is

  D3 = IF(C2>0, D2+1, 0)

I believe I should be able to something like this, but how do I add in the Case when previous value is zero then reset the sum?

SELECT
   a1.*,
       SUM(a1.VAL1) OVER (PARTITION BY a1.KEY1 ORDER  BY a1.Date_ ) AS CUMU_VAL2
FROM source_table a1

Gordon Linoff · Answer 1 · 2018-09-25T00:44:47.243

1

You can assign a group -- which is the sum of 0s after a given row. Then use count():

select t.KEY1, t.Date_, t.VAL1,
       count(*) over (partition by key1, grp, (case when val1 = 0 then 0 else 1 end)
                      order by date_
                     ) as cume_val1
from (select t.*,
             sum(case when a.val1 = 0 then 1 else 0 end) over (partition by key1 order by date_ rows between 1 following and unbounded following) as grp
      from source_table t
     ) t;

If val1 only takes on the values 0 and 1, then use row_number() instead of count().

edited Sep 25 '18 at 00:44

answered Sep 24 '18 at 21:38

Gordon Linoff

1,242,037
58
646
786

@MatBailie . . . Good catch. Thank you. – Gordon Linoff Sep 24 '18 at 21:55
When you are doing case when a.val1 = 0 then 0, is it taking the value from previous row? The cumulative value becomes 0 only if the Val1 from previous row is 0 – yabtzey Sep 24 '18 at 21:57
I think the `case when a.val1 = 0 then 0` is incorrect. In the op's example `cum_val2` is always incremented by 1 *(`keep on adding the value by 1`)* including for the `0` row. I think a simple `ROW_NUMBER() OVER (PARTITION BY key, grp ORDER BY Date_) - 1` would be correct. – MatBailie Sep 24 '18 at 21:57
@MatBailie - the sum should reset if lag of a1.val1 by 1 day = 0, i am not sure the best way to put this in the query. – yabtzey Sep 24 '18 at 22:03
@yabtzey . . . I see now. The definition of the group was off. – Gordon Linoff Sep 25 '18 at 00:45

score 1 · Accepted Answer · answered Sep 24 '18 at 22:08

1

My amendment to @GordonLinoff's answer as the OP didn't quite understand what I meant.

SELECT
  t.KEY1, t.Date_, t.VAL1,
  ROW_NUMBER() OVER (PARTITION BY key1, grp
                         ORDER BY Date_
                    )
                    - 1
                         AS CUMU_VAL2
FROM
(
  SELECT
    *,
    SUM(
      CASE WHEN val1 = 0 THEN 1 ELSE 0 END
    )
    OVER (
      PARTITION BY key1
          ORDER BY date_
    )
      AS grp
  FROM
    source_table
)
  t;

answered Sep 24 '18 at 22:08

MatBailie

83,401
18
103
137

I did a -2 instead of -1, so that the count of consecutive fields displays the way I want. This causes the cum_val2 to show '-1' instead of '0' if the preceding val2 field is 0, which is fine! Thank you! – yabtzey Sep 25 '18 at 14:18

SQL to calculate cumulative sum that resets based on previous value in a column in Hive

2 Answers2