3

Given an event stream for each key, I would like to maintain some internal state, and emit a state history for each event. A naive implementation would simply chunk the data by key, iterate over the events in order, maintain some internal state in a structure, and emit a row every time the state changes.

The challenge is, I would like to do this in dask, where iterating over rows is not performant. In this case, is the solution as simple as df.groupBy(key).sort_values(by='event_date').apply(state_machine_func) where state_machine_func can just iterate over the data frame? I'm not sure if this actually works.

Example data:

df.head()

Out[1]:
      key   event   event_date
0     1     A       2019-01-01
1     1     B       2019-02-01
2     2     A       2019-01-15
3     2     B       2019-04-15
4     2     F       2019-07-01
5     3     K       2019-01-02
6     3     R       2019-02-01
7     3     Z       2019-02-02
Alexander David
  • 769
  • 2
  • 8
  • 19

0 Answers0