Causes of CPU load on the machine increase when flink run more more than 12 hours

Question

I have a flink job, with parallelism set to 6, few simple transformations and the issue is that when Flink is been running for more than 12 hours for example the Load on the machine start to increase, then I thought that was because of the traffic into flink during some hours of the day, but the issue is that when the traffic goes down, the load on the machine continue a bit higher, lower than before but still higher.

Use cases:

DataStream<Event> from_source = rabbitConsumer
                .flatMap(new RabbitMQConsumer())
                .assignTimestampsAndWatermarks(new PeriodicExtractor());
SingleOutputStreamOperator<Event> data_stream = from_source 
                    .filter(new NullidsFilterFunction())
KeyedStream<String, Event> keyed_stream = data_stream.keyby(k->k.id);

/*one stateful operator*/
data_stream.map(new EventCount(x))
            .keyBy(k -> new Date(k.timestamp.getTime()).toString())
            .window(TumblingEventTimeWindows.of(Time.ninutes(30)))
            .process(new MyProcessWindowFunction())
            .addSink(new SinkFuncion());

/*two stateful operator*/
keyed_stream.window(TumblingEventTimeWindows.of(Time.ninutes(10)))
            .process(new MyProcessWindowFunction())
            .addSink(new SinkFuncion());

/*three*/
keyed_stream.filter(new FilterFunction())
            .map(new AvailabilityCheckClass())
            .addSink(new SinkFuncion());

/*four*/
product_view_keyed_stream = data_stream
            .filter(new FilterFunction())
            .map(new EventProdView(x))
            .keyBy(k -> k.id+ new Date(k.timestamp.getTime()));
  product_view_keyed_stream.addSink(new SinkFuncion());

/*five stateful operator*/
product_view_keyed_stream.window(TumblingEventTimeWindows.of(Time.ninutes(30)))
          .process(new MyProcessWindowFunction())
          .addSink(new SinkFuncion());

/*Six stateful operator with 4 ConcurrentHashMap into the state*/
keyed_stream.flatmap(new FlatMapFunction())
            .addSink(new SinkFuncion());

/*seven stateful operator*/
keyed_stream.window(TumblingEventTimeWindows.of(Time.ninutes(10)))
            .process(new MyProcessWindowFunction())
            .addSink(new SinkFuncion());

/*eight stateful operator*/
data_stream.filter(new FilterFunction())
           .keyBy(k -> k.rType.equals("ClickIdent") ? k.avidat : k.email_sha2)
           .flatmap(new FlatMapFunction())
           .addSink(new SinkFuncion());

Mi question: What could be the cause of the high CPU Uses when my flink job is running for more than 6 hours for example.

Insights: Heap Memory looks fine(no OOM), checkpoints are all completed, no losing events, JVM CPU consumption looks fine too, CMS GC young generation counter always increases (this worries me despite that should be normal because it is a counter, but increases too fast), this job is running as a simple java application (local execution not as a cluster with a flink installation, just java -jar flink.jar don't know if this has anything to do, just sharing information)

One hour example

Thanks a lot!

based on the heap memory I don't think is a memory leak, In fact, right now it start to decrease as should be, but I'm afraid that I need to find the cause, not sure whether is the GC, but what do you think about GC1? it is running with CMS at the moment. — Alter, Jul 31 '20 at 17:45
Do you see FullGC in GC logging? G1GC is designed just as CMS to decrease latency issues during garbage collection for large heaps but isn't necessary for batch jobs. FullGC is a falllback mechanism for when CMS fails — JohannesB, Jul 31 '20 at 18:00
I'm not running a batch job but stream job, have a look at this log: 2020-07-31T15:47:55.639+0000: 31.173: [GC (Allocation Failure) Desired survivor size 357564416 bytes, new threshold 7 (max 15) [PSYoungGen: 2742784K->146228K(3548160K)] 2770612K->174065K(9488384K), 0.1212284 secs] [Times: user=0.36 sys=0.08, real=0.13 secs] — Alter, Jul 31 '20 at 18:08
Are any of the maps, flatmaps, or filters using flink's key-partitioned state? If so, with which state backend? How does the key space grow over time? Are you using state TTL? — David Anderson, Jul 31 '20 at 20:20
I'm not familiar with Flink but the data you provide only says you have a lot of busy processes but it doesn't say what they are doing or even whether it has anything to do with your Flink code. In order to find that out you might want to try CPU profiling with VisualVM. Your graph also doesn't tell us how much data is processed during the increasing load. — JohannesB, Aug 01 '20 at 16:23
David, I have a general keyed stream by user id, which is used in the whole application, besides that I have more keyed streams that I need for specific use cases. All the transformations that you can see in the code above that has been marked with “stateful operator”, means that are operator using states, such as process or flat map, I’m using only ValueState and MapState. I’m using FSStateBackend. States grows around 500MB in total with more than 1 day running. Hope this information helps. Thanks — Alter, Aug 02 '20 at 13:16
Yes I’m using TTL, almost all the use cases are Configurated to have 1 day of TTL, there are just two of them with 3 days. What do you recommend to me? Thanks a lot. — Alter, Aug 02 '20 at 15:20
@JohannesB all the time that I've tried to start VisualVM from my IntellIj I've got this issue: Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. I've been trying a few methods that I have found but any of them seems to work, any idea? Kind regards and thanks a lot for answering. — Alter, Aug 02 '20 at 15:32
@DavidAnderson my checkpoint size right now after a day running is the 603MB (and they will continue increasing), and I just realized right now that I have some checkpoint duration above 3 or 4 minutes, and I have configure that timeout is 2 minutes, which I have no idea how is this happening, and all checkpoints are completed, but the checkpoint duration start increasing with the cpu load. Just giving more information. — Alter, Aug 02 '20 at 15:38
@Alejandro Deulofeu I'm not sure why you would launch VisualVM from Intellij but maybe you can find out what is wrong from a log file or the command that is executed by checking strace -f for intellij — JohannesB, Aug 02 '20 at 15:54
@JohannesB all options to start VisualVM were tried and nothing, I’ve got the same error all the time. Do you have any recommendations? Kind regards. — Alter, Aug 03 '20 at 11:57
Please ask a seperate question for your issues with VisualVM if they are not answered by this documentation: https://visualvm.github.io/troubleshooting.html — JohannesB, Aug 03 '20 at 18:02
I have tracked down the application with VisualVM, jProfiler, MemoryAnalyzer and other tools, and I was able to find a lot of bugs in the code because of unnecessary objects creation, all of them were fixed, now I don't have high load but something like hills with low load for about 1h:45m duration since the hill start until it ends, any idea? Thanks — Alter, Aug 13 '20 at 13:45
```` [73157.515s][info][gc] GC(2269) Pause Young (G1 Evacuation Pause) 9756M->9193M(12288M) (73157.415s, 73157.515s) 99.906ms [73188.940s][info][gc] GC(2270) Pause Young (G1 Evacuation Pause) 9757M->9260M(12288M) (73188.690s, 73188.940s) 249.712ms [73199.514s][info][gc] GC(2271) Pause Young (G1 Evacuation Pause) 9828M->9325M(12288M) (73199.172s, 73199.514s) 342.215ms [73205.523s][info][gc] GC(2272) Pause Young (G1 Evacuation Pause) 9901M->9338M(12288M) (73205.452s, 73205.523s) 71.129ms ``` — Alter, Aug 13 '20 at 14:05

score 1 · Answer 1 · answered Aug 02 '20 at 16:07

Since you are using a heap-based state backend (the FSStateBackend keeps its working state on the JVM heap), and the state TTL is configured to 1 (or 3) days, it's to be expected that the state size will grow. How much it will grow is very application specific; it depends on how your key space grows over time.

Can you put the 603MB checkpoint size in some context? I.e., how much state is that for each distinct key? It sounds like you are surprised, but it's not clear why.

There are many reasons why checkpoints can become slow, but generally this is an indication of either backpressure or some sort of resource contention in reaching the remote filesystem -- i.e., S3 rate limits. If you look in the Flink WebUI at the checkpointing statistics, you can look for clues there. Look to see if the checkpoint barriers taking a long time to traverse the execution graph, or if is it taking a long time for the asynchronous part of the checkpointing to write the checkpoint to the remote disks. And look for asymmetries -- is one instance taking much longer and having much more state than others?

If you are doing any blocking i/o in a user function, that can cause trouble. Or you may have significant data skew (e.g., a hot key). Or a slow network between the task manager and the distributed filesystem. Or the cluster may be under provisioned -- you may need to increase the parallelism.

You may need to increase the checkpoint timeout. If at some point the checkpoint duration becomes truly problematic, you could switch to using the RocksDB state backend, in order to be able to use incremental checkpointing (but whether this will help depends on what's going on). Or you could change the state TTL configuration to purge state more quickly.

Thank you so much for your comment David, I will have a look to everything that you’ve mentioned here, it looks to me a very good places to start watching. Thanks a lot. One final question, where do you recommend to start applying the changes that you subjected here? Kind regards. — Alter, Aug 03 '20 at 02:31
Hi @David, after 4h 22m running the job, I have 205 MB in state, 206 MB as maximum checkpoint size at the moment, all checkpoints are completed and the maximum duration is 9s which I think is fine. I have a 4 cores CPU and increased the parallelism to 8, increased the checkpoint timeout to 1h and the load start to giving a high spikes for some time and then goes down again, at the moment should be normal, but in some point it stop to going down again and it is only up — Alter, Aug 03 '20 at 20:55
The CPU load from ongoing state cleanup (part of state TTL) with the FsStateBackend is likely to increase until it reaches some plateau, but it depends on how you have configured state cleanup. See https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#cleanup-of-expired-state for details. — David Anderson, Aug 04 '20 at 08:29

Causes of CPU load on the machine increase when flink run more more than 12 hours

1 Answers1