So whilst debugging some peaks in memory usage on our prod environment, I started periodically dumping threaddumps, and have found the following part:
"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=208687.03ms elapsed=10057.72s tid=0x00007efed95b29d0 nid=0x2a runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Compiling: 41155 4 my.class::myFunction (384 bytes)
This is present in a lot of follow up dumps (spanning up to 50 seconds) whilst, according to VM.native_memory summary
, taking up memory in the range of 1G - 1.3G (various observed cases).
So, the threaddump before it starts the compilation is
"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=185150.71ms elapsed=10028.85s tid=0x00007efed95b29d0 nid=0x2a waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
The last one before finishing
"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=247100.38ms elapsed=10096.32s tid=0x00007efed95b29d0 nid=0x2a runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Compiling: 41155 4 my.class::myFunction (384 bytes)
and the one after finishing
"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=255511.25ms elapsed=10105.79s tid=0x00007efed95b29d0 nid=0x2a waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
So before compilation starts, between threaddumps (roughly all 10 seconds), the cpu time increases in the range of 0.xx ms, which is negligible.
Yet, in the time of the compilation, it rose by 70 seconds, which seems insane. The ID 41155
, which apparently is given for any compilation task and should therefor not repeat, is also the same across the relevant threaddumps. What I have yet to figure out is what the (384 bytes)
mean at the end - those remain the same.
This function is per se not special - it is called plenty of times across several pods and seems to be absolutely fine 99% of the cases - these are super rare occasions, but we can't seem to find a root cause.
What might be relevant information here is that the VM.native_memory
dump shows the memory wandering from Compilation
to Arena Chunks
, where it is then quickly taken care of and memory usage drops back to normal levels.
I know there was a memory leak on Java 17.0.4, but we are running
OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)
so this shouldn't be the issue at hand.
The question here is, where would one go from here? I've added several logs to the application at this function, but can't seem to find any "misbehaviour".
Another question would of course be: How does the code behave? As in, if it's compiling, would we even get any logs until it's finished? Idea being that it would compile -> execute and therefor the logs would only "appear" after it.
After adding the flag XX:CompileCommand=exclude,myClass::myFunction
, the output of the threaddump is now a bit different:
"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=244772.46ms elapsed=15463.68s tid=0x00007fe65e0299d0 nid=0x2c runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Compiling: 42704 ! 4 myClass::deliverMessage (318 bytes)
This helps me narrow it down a bit, but the question is, why is there a !
in the output?
One level deeper we go! The peak happened again, and now with deliverMessage
blocked from JIT, we reach the next level:
"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=209412.09ms elapsed=5769.99s tid=0x00007f855b8b69d0 nid=0x2c runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Compiling: 41589 4 my.kafka.Payload::toString (288 bytes)
So this seems to be the "root" that is triggering JIT
to eat up memory and take long compilation times. This toString
is rather big, with roughly 70 variables printed out here. In addition, it's all done with the following schema:
return "KafkaPayload{" +
"id=" + id +
", externalId='" + externalId + '\'' +
...
", eSignatureBase64Image='" + (shouldContainESignatureBase64Image ? eSignatureBase64Image : "censored") + '\'' +
...
}
Interesting here is that the big boy, namely the eSignatureBase64Image
is censored out, as it gets relatively large. Could it be that even tho it's not printed out, the JIT still has an internal conflict and somehow tries to optimize it?