0

I am running a simple Flink aggregation job which consumes from Kafka and applies multiple windows(1 hr, 2 hr...upto 24 hours) with specific sliding interval and does the aggregation on windows. Sometimes the job restarts and we loose the data as it starts the window from latest Kafka data. To overcome we have enabled checkpointing and I see the checkpointing size increasing(configs : HashMapStateBackend with Hdfs storage). What are best ways to checkpoint for a forever running Flink job and can we control the size of checkpoints as it will be huge after a few days run ??

Tried enabling checking pointing with HashMapStateBackend with Hdfs storage.

1 Answers1

1

The Flink window code should clean up state after a window has expired. Note that this is based on your workflow running in event time mode, and supplying appropriate watermarks. Also if you configure a "max lateness" then the actual wall clock time when the window state is removed is based on both the watermark timestamps and this max lateness.

Separately, you'll have window state for each sliding window x each unique key. So if you have say a 1 minute sliding window with a 24 hour duration then you'll have (1440 x # of unique keys) windows, which can cause an explosion in the size of your state.

kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • we are using Ingestion time . so not having late arrival and watermark. we have source and window and sink as the stateful operators. Currently its running with RocksDB state backend with incremental checkpointing. Did you mean it will create one operator state for one key ? It is running from last 4 days, and I am checking the space its taking in hdfs for persisted checkpointed data is not growing daily. Yesterday it was 82 GB and today its 72 Gb. Wondering , is it expiring window checkpointed data itself ? – Pritam Agarwala Nov 07 '22 at 10:19
  • If you aren't using watermarks, how are your windows firing? Or are you implementing windows yourself via custom operators? – kkrugler Nov 08 '22 at 03:19