we are running a job that has a ListState of between 300GB and 400GB and sometimes the list can grow to few thousands. In our use case, every item must have its own TTL, therefore we we create a new Timer for every new item of this ListState with a RocksDB backend on S3.
This currently is about 140+ millions of timers (that will trigger at event.timestamp + 40days).
Our problem is that suddenly the checkpointing of the job gets stuck, or VERY slow (like 1% in few hours) until it eventually timeouts. It generally stops (flink dashboard shows 0/12 (0%)
while the previous lines show 12/12 (100%)
) on a piece of the code which is pretty simple :
[...]
val myStream = env.addSource(someKafkaConsumer)
.rebalance
.map(new CounterMapFunction[ControlGroup]("source.kafkaconsumer"))
.uid("src_kafka_stream")
.name("some_name")
myStream.process(new MonitoringProcessFunction()).uid("monitoring_uuid").name(monitoring_name)
.getSideOutput(outputTag)
.keyBy(_.name)
.addSink(sink)
[...]
Few more information :
- AT_LEAST_ONCE checkpointing mode seems to get more easily stuck than EXACTLY_ONCE
- Few months ago the state went up to 1.5TB of data and I think billions of timers without any issue.
- RAM, CPU and Networking on the machines where run both taskmanagers look normal
state.backend.rocksdb.thread.num = 4
- First incident happened right when we received a flood of events (about millions in minutes) but not on the previous one.
- All of the events come from Kafka topics.
- When in AT_LEAST_ONCE checkpointing mode, the job still runs and consumes normally.
It's the second times that it happens to us that the topology runs very fine with few millions of events per day and suddenly stops checkpointing. We have no idea what could cause this.
Anyone can think of what could suddenly cause the checkpointing to get stuck?