I have an instance of zookeeper that has been running for some time... (Java 1.7.0_131
, ZK 3.5.1-1
), with -Xmx10G -XX:+UseParallelGC
.
Recently there was a leadership change, and the memory usage on most instances in the quorum went from ~200MB to 2GB+. I took a jmap
dump, and what I found that was interesting was that there was a lot of byte[]
serialization data (>1GB) that had no GC Root, but hadn't been collected.
(This is ByteArrayOutputStream
, DataOutputStream
, org.apache.jute.BinaryOutputArchive
, or HeapByteBuffer
, BinaryOutputArchive
).
Looking at the gc log, shortly before the election change, the full GC was running every 4-5 minutes. After the election, the tenuring threshold increases from 1 to 15 (max) and the full GC runs less and less often, eventually it doesn't even run on some days.
After severals days, suddenly, and mysteriously to me, something changes, and the memory plummets back to ~200MB with Full GC running every 4-5 minutes.
What I'm confused about here, is how so much memory can have no GC Root, and not get collected by a full GC. I even tried triggering a GC.run
from jcmd
a few times.
I wondered if something in ZK native land was holding onto this memory, or leaking this memory... which could explain it.
I'm looking for any debugging suggestions; I'm planning on upgrading Java 1.8
, maybe ZK 3.5.4
, but would really like to root cause this before moving on.
So far I've used visualvm, GCviewer and Eclipse MAT.
(Solid vertical black lines are full GC. Yellow is young generation).