0

I am trying to create a column that increments on a state change. The increment would happen whether or not the state has been seen before.

| epoch | state    | state_idx |
| 1     | open     | 1         |
| 2     | open     | 1         |
| 3     | closed   | 2         |
| 4     | closed   | 2         |
| 5     | open     | 3         |
| 6     | open     | 3         |
| 7     | open     | 3         |

I want state_idx so that I can group by key on state_idx. When the data is grouped, it will be faster to process on a spark cluster.

Toby
  • 3
  • 3
  • Use a lag to check if the `state` column changes, and then do a cumulative sum. Look at [my answer](https://stackoverflow.com/a/51311714/5858851) to [this question](https://stackoverflow.com/questions/51309693/grouping-consecutive-rows-in-pyspark-dataframe) for an approach you can follow. – pault Oct 03 '18 at 18:54

1 Answers1

0

This is a groups-and-islands problem. For this version, I think lag() and a cumulative sum are the best solution:

select t.*,
       sum(case when prev_state = state then 0 else 1 end) over (order by epoch) as state_idx
from (select t.*, lag(state) over (order by epoch) as prev_state
      from t
     ) t;

The lag() gets the previous state. The case assigns a flag the value of 1 or 0 depending on whether the state changed. The sum() over counts the number of such changes.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786