CMS collector not keeping pace with Old Gen

Question

On a moderately busy production server (50 app threads, 30% CPU utilisation), we're seeing a scenario where the CMS collector doesn't keep pace with objects promoted into the old generation.

My initial thoughts were that these objects were obviously still referenced, so not eligible for collection - but when the Old Gen fills and prompts a serial collection, 5.5 GiB of the 6 GiB is recovered.

The Eden space is sized at 3 GiB, and takes around 20-30 seconds to fill enough to prompt a young collection. Survivor space usage fluctuates between 800 - 1250 MiB, with a 1.5 GiB maximum (each).

With the objects in the old gen eligible for collection, and the server having plenty of (apparent) resources, I don't understand why the CMS collector isn't keeping on top of the old gen size:

What could cause this scenario and are there any solutions?

I'm aware of the occupancy fraction, but I don't understand the implications of the CMSIncrementalSafetyFactor - I've read some Oracle documentation, but I don't know what "add[ing] conservatism when computing the duty cycle" actually means..?

Alternatives

Switching to a parallel / throughput collector yields a very low GC overhead (1.8%) but leaves occasional (50 times per day) long pauses - around 20 seconds for each full GC. Even with some tuning, this isn't likely to meet our max pause target.

In an ideal world, we'd be able to experiment with the G1 collector, but for various reasons we are stuck with a Java 6 JVM.

Reviewing the output from the GC log events could give you more insight. — Marko Topolnik, Jul 30 '15 at 09:14
@MarkoTopolnik Is there any specific events I should look for? I've reviewed the logs but nothing jumps out as out of the ordinary (to my eyes). — Michael, Jul 30 '15 at 09:20
I would expect that the timing of all the CMS phase transition events and the actual amounts collected should give a clearer picture than the question presents now. — Marko Topolnik, Jul 30 '15 at 09:22

score 1 · Accepted Answer · answered Aug 18 '15 at 03:28

When you say that the CMS collector doesn't keep pace with your object promotion rate, it means that you should should see "concurrent mode failures" in the GC logs. Those are what you get when the CMS collector "loses the race" and you run out of memory before it completes.

2014-02-27T01:09:52.408-0600: 847.004: [GC 847.005: [ParNew 
(promotion failed)
Desired survivor size 78512128 bytes, new threshold 2 (max 15)
- age   1:   60284680 bytes,   60284680 total
- age   2:   32342648 bytes,   92627328 total
: 1380096K->1380096K(1380096K), 0.7375510 secs]847.743: 
[CMS2014-02-27T01:09:54.133-0600: 848.729: [CMS-concurrent-s
weep: 5.467/6.765 secs] [Times: user=21.59 sys=0.73, real=6.76 
secs]
  (concurrent mode failure): 2363866K->1763900K(4409856K),
10.6658960 secs] 3697627K->1763900K(5789952K), [CMS Perm : 
118666K->117980K(125596K)], 11.4061610 secs] 
[Times: user=11.34 sys=0.02, real=11.57 secs]

By default, the CMS collector will trigger at 92% occupancy in the old generation. Judging by the memory growth rate in your graph of the old generation usage, you are growing at about 500 MB every 5 minutes. 92% of 6GB gives you about 500 MB of headroom, which means that CMS has to win the race in less than 5 minutes, which it will. Unless...

...you have something happening behind the scenes other than the smooth traffic profile that we see in the graph. For example, do you have any background processes that refresh in-memory data structures like caches? Those types of activities create a sudden, huge amount of new, long-lived objects that need to be promoted to the old gen. It would make your smooth graph go vertical all of a sudden, and it can quickly exhaust the available memory. The CMS collector is good at handling smooth, steady traffic, but it is very vulnerable to quick bursts of activity. It is good at responding to gradual changes in your garbage generation rate, but it cannot anticipate "bursty" behavior, and I have seen many cases like this that cause it to lose the race.

Apart from completely avoiding background processes that produce sudden bursts of new objects, you can give the CMS collector a head start by decreasing the CMSInitiatingOccupancyFraction parameter down to something between 60-80 rather than the default of 92%.

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#cms.starting_a_cycle

Also, keep an eye on your PermGen space too. Unlike the Parallel Throughput Collector, the CMS collector does not collect the PermGen by default, so if it ever fills up, you end up with a stop-the-world full GC. This parameter makes the CMS collector collect the PermGen space as well: CMSClassUnloadingEnabled.

Other than that, I recommend turning on GC logging and setting: -XX:+PrintGCDetails prints details about every minor and major garbage collection

This is a great parameter that lets you see every JVM setting at startup: -XX:+PrintFlagsFinal prints the value of all JVM configuration options at startup

CMS collector not keeping pace with Old Gen

1 Answers1