0

For Apache Flink aggregations is it better to have an aggregation with complex state or to have smaller aggregations but more tasks.

For example, if I have a data stream on users watching videos over a web interface. I want aggregations for:

  • How many videos a user watches
  • How many different ip addresses a user watches videos from
  • How many different login sessions a user watches videos from
  • etc (about 10 different aspects)

Is it better for Flink resources to create an aggregation object for the user and collect stats on each of the (keeping track of the different values internally) or is it better to create multiple streams for each key combination?

inputStream
  .keyBy("accountId")
  .window(TumblingProcessingTimeWindows.of(Time.minutes(1))
  .aggregate(new UberAggregator());

Where the UberAggregator function can keep track of all the different values for the different facets

OR

inputStream
  .keyBy("accountId", "ipAddress)
  .window(TumblingProcessingTimeWindows.of(Time.minutes(1))
  .aggregate(new SumAggregator());

inputStream
  .keyBy("accountId", "videoId")
  .window(TumblingProcessingTimeWindows.of(Time.minutes(1))
  .aggregate(new SumAggregator());

inputStream
  .keyBy("accountId", "sessionId")
  .window(TumblingProcessingTimeWindows.of(Time.minutes(1))
  .aggregate(new SumAggregator());

...

Where the SumAggregator is a simple aggration function that keeps track of one thing.

Thor
  • 600
  • 1
  • 6
  • 17
  • 1
    What do you mean by "better"? Better performance, or easier development, or better resiliency to data skew, or something else? – kkrugler Apr 29 '23 at 20:50
  • 1
    "better" is admittedly, open ended. Feedback on better in any capacity would be appreciated. Though, I am most interested in resource usage "for Flink resources". – Thor May 01 '23 at 14:13

3 Answers3

1

It really depends on the scale of data, and also distribution of keys in streams.

For small data set with limited account ID range, limited IP address/session ID per account, it is totally fine to use UberAgg in streams. Flink internally can manage all data(unique IPs, session IDs) in state.

For large date set with larger account ID range, but limited IP address/session ID(so there is no case like some account has very large set of unique IP addresses/session IDs), I'd prefer using UberAgg. To avoid state gets too big, we can set higher parallelism based on the estimation of total account IDs. Flink will handle scalability well.

For large data set with skewed account ID or some account has very large IP set, which will make some operator have large state(this will cause slow tasks in streaming jobs), then I'd prefer the second solution. Though 3 keyBy will introduce extra data shuffling, but they can also handle skew data issue by adding extra fields in the key.

BrightFlow
  • 1,294
  • 8
  • 13
  • Thank you for the explanation. I think yours differs a little from the previous explanations, but I understand your point. This would be for a large data set, with relatively limited number of IP addresses/sessions, but can get large in some cases (hundreds). – Thor May 09 '23 at 14:59
1

From a state management perspective, the UberAggregator is better. Let us assume that the Flink deployment uses the RocksDB backed store for state persistence (which is the common case).

In Flink the keyed operator state is partitioned and distributed to all the parallel operator instances. Each operator instance is assigned a key group which is a subset of all the keys.

In the UberAggregator example, there is one the keyed operator. This operator's state will consist of accountId as the key and their value states will be stored as column families for this key.

If we assume that there are 1 million possible accountIds, UberAggregator approach will result in 1 million keys with values.

Compare it with using 10 different SumAggregator operators. Each operator will have its unique 1 million keys (for the accountId). This will means there will be 10 million keys maintained in the RocksDB backend.

This increased size of the state will have an impact in checkpointing, failed task recovery, etc.

Hence, UberAggregator is better.

Shankar
  • 2,625
  • 3
  • 25
  • 49
  • 1
    Thank you for the explanation. There would have several millions in the case I am planning for. – Thor May 09 '23 at 15:01
1

Most efficient would be a single UberAggregator. Network shuffles are expensive, so doing 10 keyBy()s isn't optimal, which is why having one keyBy() is going to be more efficient.

That's also the simplest workflow, and you want to start with the simplest solution that might work, and then measure/optimize.

kkrugler
  • 8,145
  • 6
  • 24
  • 18