Java ConcurrentMarkSweep garbage collector not happening

Question

Environment Details: OS: Linux RedHat Java: JRE 6 Update 21

I am using following GC setting for my app.

-server -d64 -Xms8192m -Xmx8192m -javaagent:lib/instrum.jar -XX\:MaxPermSize=256m -XX\:+UseParNewGC -X\:+ParallelRefProcEnabled -XX\:+UseConcMarkSweepGC -XX\:MaxGCPauseMillis=250 -XX\:+CMSIncrementalMode -XX\:+CMSIncrementalPacing -XX\:+CMSParallelRemarkEnabled -verbose\:gc -Xloggc\:/tmp/my-gc.log -XX\:DisableExplicitGC -XX\:+PrintGCTimeStamps -XX\:+PrintGCDetails -XX\:+UseCompressedOops

With there setting, there is single Full GC at the begining of application

2.946: [Full GC 2.946: [CMS: 0K->7394K(8111744K), 0.1364080 secs] 38550K->7394K(8360960K), [CMS Perm : 21247K->21216K(21248K)], 0.1365530 secs] [Times: user=0.10 sys=0.04, real=0.14 secs]

Which is followed by a 4-5 successful of CMS collections, But after this there is no trace of CMS in logs, there are entries on only minor collections.

379022.293: [GC 379022.293: [ParNew: 228000K->4959K(249216K), 0.0152000 secs] 7067945K->6845720K(8360960K) icms_dc=0 , 0.0153940 secs]

The heap is growing continuously and it has reached 7GB. We have to restart the application as we can not afford OOM or any kind of breakdown in production system.

I am not able to understand as to why CMS collector has stopped cleaning. Any clues/suggestions are welcome. Thanks in Advance.

====================================================================================== Updated 23rd Jan.

Thanks everyone for the responses till now. I have setup the application in test environment and tested the app with following set of JVM options:

Option #1

-server -d64 -Xms8192m -Xmx8192m -javaagent\:instrum.jar -XX\:MaxPermSize\=256m -XX\:+UseParNewGC -XX\:+UseConcMarkSweepGC -verbose\:gc -Xloggc\:my-gc.log -XX\:+PrintGCTimeStamps -XX\:+PrintGCDetails

Option #2

-server -d64 -Xms8192m -Xmx8192m -javaagent\:instrum.jar -XX\:MaxPermSize\=256m -XX\:+UseParNewGC -XX\:+UseConcMarkSweepGC -verbose\:gc -Xloggc\:my-gc.log -XX\:+DisableExplicitGC -XX\:+PrintGCTimeStamps -XX\:+PrintGCDetails

I ran the test with both settings for 2 days in parallel. These are my observations:

Option #1 The heap memory is stable but there are 90 ConcurrentMarkSweep collections and JVM spent 24 minutes. That’s too high. And I see following lines in GC logs and the pattern continues every one hour...

318995.941: [GC 318995.941: [ParNew: 230230K->8627K(249216K), 0.0107540 secs] 5687617K->5466913K(8360960K), 0.0109030 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
319050.363: [GC 319050.363: [ParNew: 230195K->9076K(249216K), 0.0118420 secs] 5688481K->5468316K(8360960K), 0.0120470 secs] [Times: user=0.12 sys=0.01, real=0.01 secs]
319134.118: [GC 319134.118: [ParNew: 230644K->8503K(249216K), 0.0105910 secs] 5689884K->5468704K(8360960K), 0.0107430 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
319159.250: [Full GC (System) 319159.250: [CMS: 5460200K->5412132K(8111744K), 19.1981050 secs] 5497326K->5412132K(8360960K), [CMS Perm : 72243K->72239K(120136K)], 19.1983210 secs] [Times: user=19.14 sys=0.06, real=19.19 secs]

I don’t see the concurrent mark and sweep logs. Does this mean CMS switched to throughput collector? If so why?

Option #2:

Since I see the Full GC (System) logs, I thought of adding -XX\:+DisableExplicitGC. But with that option the collection is not happening and the current heap size is 7.5G. What I am wondering is why CMS is doing the Full GC instead of concurrent collection.

are you running on a machine with a single core? if not, get rid of the incremental mode switches. This mode is pretty much completely obsolete these days. btw, it seems strange to say you can't afford an OoM in production but you can afford to restart the entire app. Do you see the same behaviour in your test environment? — Matt, Jan 10 '12 at 13:13
application is running on multicore. Yes we can restart in noo-working hours but if production breakdowns in working hours its a huge problem...I have not seen this in test possibly because dont have enough load. I am trying to produce this is test now. — rontu, Jan 10 '12 at 16:05
it sounds like your test strategy is inadequate tbh as you are running with non standard GC params that have never been subject to real stress — Matt, Jan 10 '12 at 16:14
the GC log you show is just a single CMS collection (which is made up of a number of different phases). See http://blogs.oracle.com/poonam/entry/understanding_cms_gc_logs for details on decoding what it means. — Matt, Jan 10 '12 at 16:20
in case 1, is it almost exactly on the hour every hour? if so, that will be DGC (distributed GC) caused by RMI. You can demonstrate this by varying the gcInterval system props - http://docs.oracle.com/javase/6/docs/technotes/guides/rmi/sunrmiproperties.html — Matt, Jan 23 '12 at 22:08
generally speaking you're not providing sufficient log info for anyone to really comment on what is happening, a couple of lines is not enough. For example "heap keep on growing", details needed — Matt, Jan 23 '12 at 22:09
Sorry Matt, Can I attach the gc log file here? When I say "heap keeps on growing", I mean the heap size is approaching 7.5G from 5.5G and there is no collection at all. But after my last edit I saw a concurrent collection which brought down the memory to 6.5G. — rontu, Jan 24 '12 at 04:15
I tried one more parameter CMSInitiatingOccupancyFraction=60 along with -XX\:+UseParNewGC -XX\:+UseConcMarkSweepGC. With this option I see concurrent mark-sweep lines in gc log at lower heap levels, but the CMS-concurrent-abortable-preclean-start step aborts continously. The ParNew collection happens every 50-60 seconds. Does that gives you a fair idea of my environment. — rontu, Jan 24 '12 at 04:23

Stephen C · Answer 1 · 2012-01-10T03:52:43.560

1

This is a theory ...

I suspect that those CMS collections were not entirely successful. The event at 12477.056 looks like the CMS might have decided that it is not going to be able to work properly due to the 'pre-clean' step taking too long.

If that caused the CMS to decide to switch off, then I expect it will revert to using the classic "throughput" GC algorithm. And there's a good chance it would wait until the heap is full and then it would run a full GC. In short, if you'd just let it continue it would have been OK (modulo that you'd get big GC pauses every now and then.)

I suggest you run your application on a test server with the same heap size and other GC parameters, and see what happens when the server hits the limit. Does it actually throw an OOME?

edited Jan 10 '12 at 03:52

answered Jan 10 '12 at 03:47

Stephen C

698,415
94
811
1,216

1

CMS does not fallback to ParMarkSweep, it fails over to a serial compacting algorithm. – Greg Bowyer Jan 13 '12 at 03:46
Ughh ... that means that if CMS is failing, the OP would do better to not use CMS at all. – Stephen C Jan 13 '12 at 05:56
1

CMS is running for the OP, its just subtle, its running in incremental mode – Greg Bowyer Jan 13 '12 at 08:33

Greg Bowyer · Answer 2 · 2012-01-13T04:53:38.320

CMS is running for you :P

You are using incremental mode on CMS (although really you should not bother as its likely punishing your throughput)

The icms_dc in your posted log line is a give away, the only thing that logs this in the JVM is ... the CMS collector, its saying for that GC run you did a small amount of tenure cleanup interwoven with the application.

This part of your log relates to parallel new (the give away there is the heap size)

379022.293: [GC 379022.293: [ParNew: 228000K->4959K(249216K), 0.0152000 secs]

this part is incremenatal CMS (iCMS)

7067945K->6845720K(8360960K) icms_dc=0 , 0.0153940 secs]

I would ask, why are you using iCMS, do you have a lot of Soft/Weak/Phantom references (or why are you using the ParallelRefProcEnabled flag) and have you actually seen an Out of memory, or insufferable pause.

Try backing down to CompressedOops, ParNewGC and CMS without anything else fancy and see if that works out for you.

score 0 · Answer 3 · answered Jan 10 '12 at 04:07

0

I can see that the initial heap size -Xms is :8192m and max heap size is -Xmx8192m, which might be one of the reasons why GC is still waiting to start sweeping. I would suggest to decrease the heap size and then check if the GC kicks in.

answered Jan 10 '12 at 04:07

Niranjan Bhat B

1

score 0 · Answer 4 · answered Jan 10 '12 at 08:15

When you set the maximum size, it allocates that amount of virtual memory immediately.

When you set the minimum size, it has already allocated the maximum size. All the minimum size does is to take minimal steps to free up memory until this maximum is reached. This could be reducing the number of full GCs because you told it to use up to 8 GB freely.

You have a lot of options turned on (some of them the default) I suggest you strip back to a minimum set as they can have odd interactions when you turn lots of.

I would start with (assuming you have Solaris)

-mx8g -javaagent:lib/instrum.jar -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -verbose\:gc -Xloggc\:/tmp/my-gc.log -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

The options -server is the default on server class machines, -XX:+UseCompressedOops is the default on recent versions of Java and -XX:MaxGCPauseMillis=250 is just a hint.

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

AFAIK, You only need `-d64` on Solaris. Linux RHEL is considered a server class machine. — Peter Lawrey, Jan 10 '12 at 16:21

Java ConcurrentMarkSweep garbage collector not happening

4 Answers4