SQL Increment value on state change

Question

I am trying to create a column that increments on a state change. The increment would happen whether or not the state has been seen before.

| epoch | state    | state_idx |
| 1     | open     | 1         |
| 2     | open     | 1         |
| 3     | closed   | 2         |
| 4     | closed   | 2         |
| 5     | open     | 3         |
| 6     | open     | 3         |
| 7     | open     | 3         |

I want state_idx so that I can group by key on state_idx. When the data is grouped, it will be faster to process on a spark cluster.

Use a lag to check if the `state` column changes, and then do a cumulative sum. Look at [my answer](https://stackoverflow.com/a/51311714/5858851) to [this question](https://stackoverflow.com/questions/51309693/grouping-consecutive-rows-in-pyspark-dataframe) for an approach you can follow. — pault, Oct 03 '18 at 18:54

score 0 · Accepted Answer · answered Oct 03 '18 at 19:37

This is a groups-and-islands problem. For this version, I think lag() and a cumulative sum are the best solution:

select t.*,
       sum(case when prev_state = state then 0 else 1 end) over (order by epoch) as state_idx
from (select t.*, lag(state) over (order by epoch) as prev_state
      from t
     ) t;

The lag() gets the previous state. The case assigns a flag the value of 1 or 0 depending on whether the state changed. The sum() over counts the number of such changes.

SQL Increment value on state change

1 Answers1