Flink checkpoints keeps failing

Question

we are trying to setup a Flink stateful job using RocksDB backend. We are using session window, with 30mins gap. We use aggregateFunction, so not using any Flink state variables. With sampling, we have less than 20k events/s, 20 - 30 new sessions/s. Our session basically gather all the events. the size of the session accumulator would go up along time. We are using 10G memory in total with Flink 1.9, 128 containers. Following's the settings:

state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/myjob/path
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.4
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1

containerized.heap-cutoff-ratio: 0.45
taskmanager.network.memory.fraction: 0.5
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 2560mb

From our monitoring of a given time, rocksdb memtable size is less than 10m, Our heap usage is less than 1G, but our direct memory usage (network buffer) is using 2.5G. The buffer pool/ buffer usage metrics are all at 1 (full). Our checkpoints keep failing, I wonder if it's normal that the network buffer part could use up this much memory?

I'd really appreciate if you can give some suggestions:) Thank you!

David Anderson · Accepted Answer · 2020-10-13T09:25:23.430

4

For what it's worth, session windows do use Flink state internally. (So do most sources and sinks.) Depending on how you are gathering the session events into the session accumulator, this could be a performance problem. If you need to gather all of the events together, why are you doing this with an AggregateFunction, rather than having Flink do this for you?

For the best windowing performance, you want to use a ReduceFunction or an AggregateFunction that incrementally reduces/aggregates the window, keeping only a small bit of state that will ultimately be the result of the window. If, on the other hand, you use only a ProcessWindowFunction without pre-aggregation, then Flink will internally use an appending list state object that when used with RocksDB is very efficient -- it only has to serialize each event to append it to the end of the list. When the window is ultimately triggered, the list is delivered to you as an Iterable that is deserialized in chunks. On the other hand, if you roll your own solution with an AggregateFunction, you may have RocksDB deserializing and reserializing the accumulator on every access/update. This can become very expensive, and may explain why the checkpoints are failing.

Another interesting fact you've shared is that the buffer pool / buffer usage metrics show that they are fully utilized. This is an indication of significant backpressure, which in turn would explain why the checkpoints are failing. Checkpointing relies on the checkpoint barriers being able to traverse the entire execution graph, checkpointing each operator as they go, and completing a full sweep of the job before timing out. With backpressure, this can fail.

The most common cause of backpressure is under-provisioning -- or in other words, overwhelming the cluster. The network buffer pools become fully utilized because the operators can't keep up. The answer is not to increase buffering, but to remove/fix the bottleneck.

edited Oct 13 '20 at 09:25

answered Oct 13 '20 at 07:48

David Anderson

39,434
4
33
60

Thank you so much David, I switched to ProcessWindowFunction, and the checkpoints are all successful now:) But the network buffer usage's still pretty high (better than before), I'll update the graph in the question itself. I checked our job, there's no backpressure shown on the Flink UI though. For the RocksDB usage, the metrics shows much less memory used than the state sizes I saw on Flink UI. Could you please suggest? – lucky_start_izumi Oct 14 '20 at 16:59
With Flink 1.9 the backpressure detection isn't very robust. The buffer pool and inputQueueLength metrics tell a more complete story -- and it looks like you have backpressure, but it's not continuous. What is your #1 concern at this point? And have you tried increasing the parallelism? – David Anderson Oct 14 '20 at 17:14
My concern's that if the network buffer usage's pretty full now, would the job be able to handle when the traffic grows, or when there's some spikes. Right now we are testing using about 1/15 traffic for this testing. We have 128 containers, now each has 20G memory in total, the CPU usage's very low, less than 1% usually. We'll probably try with 256 containers (since our Kafka source has 256 partitions), but it feels like wasting CPU resources, and more machines talking to each other, may introduce network bottleneck. – lucky_start_izumi Oct 14 '20 at 17:20
Ideally you should provision the job so that it can handle normal loads with only occasional backpressure, so that you have the capacity to handle typical load spikes without falling behind (assuming you are targeting a "keeping up with real-time" scenario). (BTW, low CPU can be misleading -- with data skew you might have one hot key overwhelming one core while the remaining cores are doing nothing.) – David Anderson Oct 14 '20 at 17:37
There are lots of things that can cause backpressure, so increasing the parallelism might not be the right answer. Or in other words, there are many ways to abuse Flink and create performance problems. If you want to share the job topology that might permit some more concrete suggestions. But common mistakes include too using keyBy too often, inefficient serializers, preventing operator chaining, not enabling object reuse, ... – David Anderson Oct 14 '20 at 17:37
To clarify, I wouldn't be too concerned by those graphs. Only it you have constant backpressure do you need to take action. If latency and checkpoint times are ok, then you are probably fine. Flink doesn't use that many network buffers, so you should expect they are sometimes fully utilized. – David Anderson Oct 14 '20 at 17:47
Thank you David. Our job's very simple, a filter and the a key by (session id) with session window 30min gap, using ProcessWindowFunction. I'll update the topology in the question itself. Our CPU usage's pretty even across all containers. I feel it's hard to tell if then job's robust or not at this point. Do you have any tuning guidance on this? Thank you so much! – lucky_start_izumi Oct 14 '20 at 18:52
1

It's hard to provide good advice via SO; can mostly only answer straightforward questions. But one thing we sometimes do is to implement a parameterized data generator that approximates the salient characteristics of production data, and then use that to see how the pipeline reacts under different kinds of load. See https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html for some suggestions of how to check if your job is healthy. – David Anderson Oct 14 '20 at 19:26

Flink checkpoints keeps failing

1 Answers1

Linked