Given an event stream for each key, I would like to maintain some internal state, and emit a state history for each event. A naive implementation would simply chunk the data by key, iterate over the events in order, maintain some internal state in a structure, and emit a row every time the state changes.
The challenge is, I would like to do this in dask, where iterating over rows is not performant. In this case, is the solution as simple as df.groupBy(key).sort_values(by='event_date').apply(state_machine_func)
where state_machine_func
can just iterate over the data frame? I'm not sure if this actually works.
Example data:
df.head()
Out[1]:
key event event_date
0 1 A 2019-01-01
1 1 B 2019-02-01
2 2 A 2019-01-15
3 2 B 2019-04-15
4 2 F 2019-07-01
5 3 K 2019-01-02
6 3 R 2019-02-01
7 3 Z 2019-02-02