Why is apache flink checkpoint size very large?

Question

I've simple Apache Flink job:
**DataSource (Apache Kafka) - Filter - KeyBy - CEP Pattern (with timer) - PatternProcessFucntion - KeyedProcessFunction (*here I've ValueState(Boolean) and registering timer on 5 minutes. If a valueState not null I'll update valueState (nothing to send in collector) and update timer. If a valueState is null, I'll save in state TRUE, then send input event in collector and setting timer. When onTimer method is ready, I'll clean my ValueState*) - Sink (Apache Kafka)**.
Job settings:
**Checkpointing interval: 5000ms**
**Incremental checkpointing: true**
**Semantic: Exactly Once**
**State Backend: RocksDB**
**Parallelism: 4**
Logically my job is working perfectly, but I've some problems.
I had two tests on my cluster (2 job manager and 3 task manager):
**First test:**
I started my job and connected to an empty Apache Kafka topic then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
2)State size = 340 kb at each 5sec interval
3)All status was completed (blue).
**Second test:**
I started sending json-messages with other keys (from "1" to Integer.MAX_VALUE) in Apache Kafka topic. Sending speed was: 1000 messages/sec then I saw in Flink WEB UI **Checkpointing Statistics:**
1)Latest Acknowledgement - Trigger Time = 1 - 6 minutes
**My Question #1: Why is this time growing? It is bad or OK?**
2) State size was constantly growing. I sent messages in Kafka for about 10 minutes (1000 x 60 x 10 = 600000 messages). After sending State size was 100mb - 150mb.
3)After sending I waited about an one hour and saw that:
Latest Acknowledgement - Trigger Time = 5000ms (like my checkpoint interval)
State size was: 100mb - 150mb at each 5sec interval.
**My question #2: Why doesn't it decrease? After all I checked my job logs and saw 600000 records: ValueState for **key** was cleared (OnTimer method was successfully) and job logics (see description my KeyedProcessFunction) was working great**
What was I trying to do?
1)setting pause between checkpoints
2)disable incremental checkpoints
3)enable async checkpoints (in flink-conf.yml)
It doesn't give any changes!!!
**My question #3: What should I do?? Because on industrial server speed is: *10 millions messages/hour* and checkpoint size is increases instantly.**

Yes, David. I am using Flink CEP library. But I always to use pattern with .within(Time.seconds(....)) — Viacheslav Ryabchikov, Dec 30 '20 at 07:46

Why is apache flink checkpoint size very large?

0 Answers0