0

I'm trying to calculate moving average for the last 24h of events. I'm struggling to understand how to implement the "moving" part of the calculation with Apache Beam.

My scenario is as follows:

  1. Given unbound stream of events where each event has user and value fields.
  2. After I group the events by user to get per-user streams.
  3. Calculate the average per-user value over the last 24h. If I do it by hand:
  • Take current time
  • Find events where the event time is >= now - 24h
  • For these events average the value field to get a single number.
  1. Sink the calculated per-user average value to a database table.

The moving aspect here is that when an event expires (clock ticks forward and event time becomes < now - 24h) the average value should be re-calucated.

What I tried:

  • It isn't a FixedWindow because it's moving. I don't want to know what is the daily average value for given day. I want to know what is the current average value within last 24h.
  • Since the user generates events randomly documentation suggests this could be a use case for SessionWindow but I'm not interested in understanding user behaviour or finding sessions. Not sure if SessionWindow fits because it's started by event time and I'm looking for fixed clock time boundaries.

Can someone please explain how this moving average should be implemented in Beam?

I'm new to Apache Beam, just started. I look at Windowing and Trigger documentation and review the leader board example. So far I haven't found an example for calculating moving values.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111

1 Answers1

1

This depends on how often you want the outputs.

If you want to have a 24h moving average every hour, you can use sliding windows, e.g. with a duration of 24 hours and a period of 1 hours. Every hour you will get a grouping of all the events over the last 24 hours. When aggregated with a CombinePerKey operation (like MeanCombineFn) this will aggregate elements as they come in and emit the mean every hour (per user).

Sliding windows can become a bit unwieldy if there are too many (e.g. every minute over a 24-hour timespan). This can be done manually with state and timers, by storing all events in state and when new elements come in and/or a timer fires computing and emitting the average (and cleaning up the old state to avoid collecting things forever).

robertwb
  • 4,891
  • 18
  • 21