Cummulative sum by year_month, location, state in Hive SQL

Question

I want to make a cumulative count using Hive SQL in recorrencia column according to the other ones.

 +------------+---------+-------+--------------+--+
| t.ano_mes  | t.site  | t.uf  | recorrencia  |
+------------+---------+-------+--------------+--+
| 202001     | 174     | AM    | 1            |
| 202002     | 174     | AM    | 1            |
| 202003     | 174     | AM    | 1            |
| 202004     | 174     | AM    | 1            |
| 202005     | 174     | AM    | 1            |
| 202006     | 174     | AM    | 1            |
| 202007     | 174     | AM    | 1            |
| 202008     | 174     | AM    | 1            |
| 202005     | 1JN     | SP    | 1            |
| 202006     | 1JN     | SP    | 1            |
| 202005     | 1LJ     | SP    | 1            |
| 202009     | 1LJ     | SP    | 1            |
| 202001     | 1RG     | SP    | 1            |
| 202002     | 1RG     | SP    | 1            |
| 202003     | 1RG     | SP    | 1            |
| 202004     | 1RG     | SP    | 1            |
| 202005     | 1RG     | SP    | 1            |
| 202006     | 1RG     | SP    | 1            |
| 202007     | 1RG     | SP    | 1            |

Desired output

+------------+---------+-------+--------------+--------+
| t.ano_mes  | t.site  | t.uf  | recorrencia  |cum_rec
+------------+---------+-------+--------------+--------+
| 202001     | 174     | AM    | 1            |1
| 202002     | 174     | AM    | 1            |2
| 202003     | 174     | AM    | 1            |3
| 202004     | 174     | AM    | 1            |4
| 202005     | 174     | AM    | 1            |5
| 202006     | 174     | AM    | 1            |6
| 202007     | 174     | AM    | 1            |7
| 202008     | 174     | AM    | 1            |8
| 202005     | 1JN     | SP    | 1            |1
| 202006     | 1JN     | SP    | 1            |2
| 202005     | 1LJ     | SP    | 1            |1
| 202009     | 1LJ     | SP    | 1            |2
| 202001     | 1RG     | SP    | 1            |1
| 202002     | 1RG     | SP    | 1            |2
| 202003     | 1RG     | SP    | 1            |3
| 202004     | 1RG     | SP    | 1            |4
| 202005     | 1RG     | SP    | 1            |5
| 202006     | 1RG     | SP    | 1            |6
| 202007     | 1RG     | SP    | 1            |7

I've tried a lot of functions like COUNT(*) OVER (t.ano_mes) and COUNT(*) OVER (t.site) but it runs the sum until the end of table, and do not restarts as the t.site changes.

As soon as t.site changes, the counter should restart.

score 2 · Accepted Answer · answered Sep 24 '20 at 15:40

2

That would be:

sum(recorrencia) over(partition by t.site order by t.ano_mes) as cum_rec

The partition by clause causes the sum to reset every time the site changes.

Note that if recorrencia is always 1, as shown in your sample data, then row_number() is sufficient:

row_number() over(partition by t.site order by t.ano_mes) as cum_rec

answered Sep 24 '20 at 15:40

GMB

216,147
25
84
135

i think you mean to say 'Note that if recorrencia is `not` always 1, ... ' – Koushik Roy Sep 24 '20 at 16:03
1

Thanks. It did help me a lot! – Henrique Branco Sep 24 '20 at 16:57

Cummulative sum by year_month, location, state in Hive SQL

1 Answers1