Our Scala application (Kubernetes deployment) constantly experience Akka Cluster heartbeat delays of ≈3s.
Once we even had a 200s delay which also manifested itself in the following graph:
Can someone suggest things to investigate further?
Specs
- Kubernetes 1.12.5
requests.cpu = 16
# limits.cpu not set
- Scala 2.12.7
- Java 11.0.4+11
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+AlwaysPreTouch
-Xlog:gc*,safepoint,gc+ergo*=trace,gc+age=trace:file=/data/gc.log:time,level,tags:filecount=4,filesize=256M
-XX:+PerfDisableSharedMem
- Akka Cluster 2.5.25
Java Flight Recording
Some example:
timestamp delay_ms
06:24:55.743 2693
06:30:01.424 3390
07:31:07.495 2487
07:36:12.775 3758
There were 4 suspicious time points where lots of Java Thread Park events were registered simultaneously for Akka threads (actors & remoting) and all of them correlate to heartbeat issues:
Around 07:05:39
there were no "heartbeat was delayed" logs, but was this one:
07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis
No correlation with halt events or blocked threads were found during Java Flight Recording session, only two Safepoint Begin events in proximity to delays:
CFS throttling
The application CPU usage is low, so we thought it could be related to
how K8s schedule our application node for CPU.
But turning off CPU limits haven't improved things much,
though kubernetes.cpu.cfs.throttled.second
metric disappeared.
Separate dispatcher
Using a separate dispatcher seems to be unnecessary since delays happen even when there is no load, we also built an explicit application similar to our own which does nothing but heartbeats and it still experience these delays.
K8s cluster
From our observations it happens way more frequently on a couple of K8s nodes in a large K8s cluster shared with many other apps when our application doesn't loaded much.
A separate dedicated K8s cluster where our app is load tested almost have no issues with heartbeat delays.