0

I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction

The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.

My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?

So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?

1 Answers1

1

The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.

The state is persisted for us by Flink behind the scene though, when we write something like this:

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .aggregate(new AverageAggregate(), new MyProcessWindowFunction());

the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.

In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.

Svend
  • 6,352
  • 1
  • 25
  • 38
  • Thank you so much for your answers. My initial understanding was that only the State I created such as ```state = getRuntimeContext().getState(valueStateDescriptor);``` will be checkpoint and store, but apparently i am wrong. – user2289345 Feb 19 '21 at 17:46
  • Thanks for the positive feed-back. What's confusing is that there are many layers in Flink. In the lowest level of the DataStream API, we indeed have to declare state explicitly in the way you describe in classes like `RichMapFunction` or `KeyedProcessFunction` or so. On top of that, the `window` operations provide an abstraction layer that hides those details and do the state stuff for us. And even on top of that, the table API and SQL layer make it all look declarative and hide many execution details, although behind the scene there's again a bunch of state that is managed for us. – Svend Feb 20 '21 at 13:52
  • Thank you :) Yeah, i am more used to low level of DataStream API, coming from other streaming system, higher level all the way to declarative sometimes is very confusing to me. – user2289345 Feb 28 '21 at 06:35