I am running a simple Flink aggregation job which consumes from Kafka and applies multiple windows(1 hr, 2 hr...upto 24 hours) with specific sliding interval and does the aggregation on windows. Sometimes the job restarts and we loose the data as it starts the window from latest Kafka data. To overcome we have enabled checkpointing and I see the checkpointing size increasing(configs : HashMapStateBackend with Hdfs storage). What are best ways to checkpoint for a forever running Flink job and can we control the size of checkpoints as it will be huge after a few days run ??
Tried enabling checking pointing with HashMapStateBackend with Hdfs storage.