1

I want to make a cumulative count using Hive SQL in recorrencia column according to the other ones.

 +------------+---------+-------+--------------+--+
| t.ano_mes  | t.site  | t.uf  | recorrencia  |
+------------+---------+-------+--------------+--+
| 202001     | 174     | AM    | 1            |
| 202002     | 174     | AM    | 1            |
| 202003     | 174     | AM    | 1            |
| 202004     | 174     | AM    | 1            |
| 202005     | 174     | AM    | 1            |
| 202006     | 174     | AM    | 1            |
| 202007     | 174     | AM    | 1            |
| 202008     | 174     | AM    | 1            |
| 202005     | 1JN     | SP    | 1            |
| 202006     | 1JN     | SP    | 1            |
| 202005     | 1LJ     | SP    | 1            |
| 202009     | 1LJ     | SP    | 1            |
| 202001     | 1RG     | SP    | 1            |
| 202002     | 1RG     | SP    | 1            |
| 202003     | 1RG     | SP    | 1            |
| 202004     | 1RG     | SP    | 1            |
| 202005     | 1RG     | SP    | 1            |
| 202006     | 1RG     | SP    | 1            |
| 202007     | 1RG     | SP    | 1            |

Desired output

+------------+---------+-------+--------------+--------+
| t.ano_mes  | t.site  | t.uf  | recorrencia  |cum_rec
+------------+---------+-------+--------------+--------+
| 202001     | 174     | AM    | 1            |1
| 202002     | 174     | AM    | 1            |2
| 202003     | 174     | AM    | 1            |3
| 202004     | 174     | AM    | 1            |4
| 202005     | 174     | AM    | 1            |5
| 202006     | 174     | AM    | 1            |6
| 202007     | 174     | AM    | 1            |7
| 202008     | 174     | AM    | 1            |8
| 202005     | 1JN     | SP    | 1            |1
| 202006     | 1JN     | SP    | 1            |2
| 202005     | 1LJ     | SP    | 1            |1
| 202009     | 1LJ     | SP    | 1            |2
| 202001     | 1RG     | SP    | 1            |1
| 202002     | 1RG     | SP    | 1            |2
| 202003     | 1RG     | SP    | 1            |3
| 202004     | 1RG     | SP    | 1            |4
| 202005     | 1RG     | SP    | 1            |5
| 202006     | 1RG     | SP    | 1            |6
| 202007     | 1RG     | SP    | 1            |7

I've tried a lot of functions like COUNT(*) OVER (t.ano_mes) and COUNT(*) OVER (t.site) but it runs the sum until the end of table, and do not restarts as the t.site changes.

As soon as t.site changes, the counter should restart.

GMB
  • 216,147
  • 25
  • 84
  • 135
Henrique Branco
  • 1,778
  • 1
  • 13
  • 40

1 Answers1

2

That would be:

sum(recorrencia) over(partition by t.site order by t.ano_mes) as cum_rec

The partition by clause causes the sum to reset every time the site changes.

Note that if recorrencia is always 1, as shown in your sample data, then row_number() is sufficient:

row_number() over(partition by t.site order by t.ano_mes) as cum_rec
GMB
  • 216,147
  • 25
  • 84
  • 135