Apache Flink and Kafka Latency without BackPressure

Question

I have flink application, with 48 parallelism(1 jobManager, 3taskManagers) and almost 2300-2400 tasks.

But sometimes flink can't consume kafka records quickly, and this causes latency.

In the graphs there is no backpressure in any task (I got the results from prometheus integration, flink_taskmanager_job_task_isBackPressured)

I am using mainly rocksdb to store state, only 5-6 streams (with 48 parallelism) are using registerProcessingTimeTimer()

There are no checkpoint & savepoint operations

What can be the problem? (or should i add new node to the cluster?)

Object from kafka records contains 24 primitive fields, 1 complex object (include 15 primitive fields), 1 map<String,String> data type, 1 another complex object(include 8 primitive fields),

Have you tried hashmap state backend to compare performance ? in my case, we also had a back pressure caused by RocksDB so probably you could try to narrow down the problem. — Mikalai Lushchytski, Aug 10 '21 at 13:21
@MikalaiLushchytski, I have not tried hashmap state backend. I have tons of state in my application and I believe holding all the states in memory is not recommended? And also how did you know that the problem was RocksDB? Maybe I should increase the amount of threads for the property `state.backend.rocksdb.checkpoint.transfer.thread.num`? — monstereo, Aug 10 '21 at 14:13
https://stackoverflow.com/a/63956701/2000823 has some ideas for you to consider. If your goal is to improve throughput (and average latency) then the hashmap state backend will definitely help (provided, of course, that you can give it enough memory). — David Anderson, Aug 10 '21 at 17:58
If you are on Flink 1.13, `backPressuredTimeMsPerSecond` will give you a much more accurate measure of the extent to which backpressure is affecting your job. — David Anderson, Aug 10 '21 at 18:02
Hi @DavidAnderson, to analyze further, I just enabled `metrics.latency.interval: 10000`and `state.backend.latency-track.keyed-state-enabled: true` and the flink configuration, but i could not find metrics' name for these. What are the metrics name for these configurations? (I am using prometheus with grafana integration) — monstereo, Aug 20 '21 at 12:23
The reason you're not finding the latency tracking metrics is because they are job metrics, and not task or operator metrics. See https://stackoverflow.com/a/50982071/2000823. — David Anderson, Aug 20 '21 at 13:15
Now, I got the result something like "latency.source_id.<454d...>.operator_id.<996...>.operator_subtask_index.37.latency_p75", first what it the p_75 ? and also I could not match the source_id and operator_id at least my operator's name. (There are no logs with the source_id 454d.. and operator_id 996 in both taskmanagers and job manager). At least I should be able to identify the which which source_id and operator_id points to the which operator's name in the dashboard http://localhost:8081/#/job/123adfs./overview — monstereo, Aug 20 '21 at 14:24

Apache Flink and Kafka Latency without BackPressure

0 Answers0