On a moderately busy production server (50 app threads, 30% CPU utilisation), we're seeing a scenario where the CMS collector doesn't keep pace with objects promoted into the old generation.
My initial thoughts were that these objects were obviously still referenced, so not eligible for collection - but when the Old Gen fills and prompts a serial collection, 5.5 GiB of the 6 GiB is recovered.
The Eden space is sized at 3 GiB, and takes around 20-30 seconds to fill enough to prompt a young collection. Survivor space usage fluctuates between 800 - 1250 MiB, with a 1.5 GiB maximum (each).
With the objects in the old gen eligible for collection, and the server having plenty of (apparent) resources, I don't understand why the CMS collector isn't keeping on top of the old gen size:
What could cause this scenario and are there any solutions?
I'm aware of the occupancy fraction, but I don't understand the implications of the CMSIncrementalSafetyFactor
- I've read some Oracle documentation, but I don't know what "add[ing] conservatism when computing the duty cycle" actually means..?
Alternatives
Switching to a parallel / throughput collector yields a very low GC overhead (1.8%) but leaves occasional (50 times per day) long pauses - around 20 seconds for each full GC. Even with some tuning, this isn't likely to meet our max pause target.
In an ideal world, we'd be able to experiment with the G1 collector, but for various reasons we are stuck with a Java 6 JVM.