For Apache Flink aggregations is it better to have an aggregation with complex state or to have smaller aggregations but more tasks.
For example, if I have a data stream on users watching videos over a web interface. I want aggregations for:
- How many videos a user watches
- How many different ip addresses a user watches videos from
- How many different login sessions a user watches videos from
- etc (about 10 different aspects)
Is it better for Flink resources to create an aggregation object for the user and collect stats on each of the (keeping track of the different values internally) or is it better to create multiple streams for each key combination?
inputStream
.keyBy("accountId")
.window(TumblingProcessingTimeWindows.of(Time.minutes(1))
.aggregate(new UberAggregator());
Where the UberAggregator function can keep track of all the different values for the different facets
OR
inputStream
.keyBy("accountId", "ipAddress)
.window(TumblingProcessingTimeWindows.of(Time.minutes(1))
.aggregate(new SumAggregator());
inputStream
.keyBy("accountId", "videoId")
.window(TumblingProcessingTimeWindows.of(Time.minutes(1))
.aggregate(new SumAggregator());
inputStream
.keyBy("accountId", "sessionId")
.window(TumblingProcessingTimeWindows.of(Time.minutes(1))
.aggregate(new SumAggregator());
...
Where the SumAggregator is a simple aggration function that keeps track of one thing.