Flink yarn-session mode is becoming unstable when running ~10 batch jobs at same time

Question

I am trying to set up a flink-yarn session to run ~100+ batch jobs. After getting connected to ~40 task managers and ~10 jobs running (each task manager with 2 slots and 1GB memory each) it looks like the session becomes unstable. There were enough resources available. The flink UI suddenly becomes not available, I guess the job manager might have died already. Eventually, the yarn application also got killed.

Job manager is running on 4 core 16GB node 12 gb available

Is there any guide to do the math for job manager resource vs the number of task managers it can handle?

Recommend you ask about this on the flink user mailing list. That's a better forum for tapping into the practical experience of the community. — David Anderson, Aug 16 '20 at 06:26
Which Flink version are you running? Could you share the cluster logs with us? — Till Rohrmann, Aug 19 '20 at 08:19

score 1 · Answer 1 · answered Aug 20 '20 at 00:39

I got this fixed. The reason the flink-session breaking was the low bandwidth of worker machines in the cluster. The worker machine which runs the task manager container should have at least 750Mbps or up. With each task manager having 2 slots and 1GB of memory, a moderate bandwidth ~ 450Mbps won't cut it. if the job is IO intensive, Communication between actors(job manager and workers or worker to the worker) could potentially get timed out(default ask time out is 100ms).

I decided to not to increase the ask timeout so that the jobs won't take long because of this bottleneck.

Flink yarn-session mode is becoming unstable when running ~10 batch jobs at same time

1 Answers1