6

As seen below, in the midst of everything working as expected a stop-the-world GC operation took +60 seconds. It could be determined to be a stop the world for the entire time, because the (terracotta) clients dropped of, complaining it (the terracotta server) didn't respond during that time.

Is this young/minor GC? If yes, could it be because of starvation in the young generation (eden + survivors?).

Is only 109333(KB) free'd?

I'll start graphing the different memory containers, any other suggestion what can be done to further diagnose problems like these?

date, startMem=24589884, endMem=24478495, reclaimed=111389, timeTaken=0.211244 (1172274.139: [GC 24589884K->24478495K(29343104K), 0.2112440 secs])
date, startMem=24614815, endMem=24505482, reclaimed=109333, timeTaken=61.301987 (1172276.518: [GC 24614815K->24505482K(29343104K), 61.3019860 secs])
date, startMem=24641802, endMem=24529546, reclaimed=112256, timeTaken=2.68437 (1172348.921: [GC 24641802K->24529546K(29343104K), 2.6843700 secs])

Sun JVM is 1.6, using the following config:

-Xms28672m -Xmx28672m -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGC

Sane config adjustments to further debug GC:

'-XX:+PrintGCDateStamps' Print date stamps instead of relative timestamps
'-XX:+PrintGCDetails' Will show what cpu went for (user, kern), gc algorithm used 
'-XX:+PrintHeapAtGC' will show all of the heaps memory containers and their usage
'-Xloggc:/path/to/dedicated.log' log to specific file
user135361
  • 125
  • 1
  • 6
  • What is your application doing? GC jitter occurs when large amounts of memory is being reclaimed from the heap in multiple sweeps, likely that each sweep is causing more objects to become eligible for collection, leading to further sweeps. Even the 2 second run you show is a huge amount of time for GC to take. I think you'll find changing how your application is handling objects, rather than JVM configuration is going to be the path you need to take. Any application that needs to keep GC jitter down should be looking to reuse rather than reallocate objects. – codeghost Jan 23 '13 at 09:27
  • It's a session (cookie) store. The numbers, for example "reclaimed" indicates not much memory actually was reclaimed. If so it's pretty convenient to know why. I agree with you fully that it needs to be solved in how the (session) objects are handled, what they contain etc. There's a process for handling the sessions better, but as for now I've got the job to find out why a GC would take +60 seconds and still not free more memory than the previous 0.2s gc. – user135361 Jan 23 '13 at 09:51
  • 2
    Take a look at this blog http://kirk.blog-city.com/why_do_i_have_this_long_gc_pause.htm may give you a few pointers. – codeghost Jan 23 '13 at 10:03

1 Answers1

1

-XX:+UseConcMarkSweepGC enables concurrent collection

Default Vs. CMS

Total time taken is the sum of stop-the-world phases (JVM blocked) and concurrent phases (JVM executing user code).

You should enable detailed GC logging to further investigate, as you have no information about how many of that +60 seconds are blocking the JVM.

fglez
  • 8,422
  • 4
  • 47
  • 78
  • I've already outlined that more verbose output will be logged. The question is not how CMS work, but rather under what circumstances a long stop-the-world would occur. The JVM was blocked the entire time, so it's a "major GC" and is likely because of defragmentation. If so, question is how can I know if adding more heap will help, or in what cases would scheduling cms to kick in earlier help. I appreciate your try, but you will have to take a stab at isolating the factors before I upvote you. – user135361 Jan 31 '13 at 10:34
  • 2
    How do you know the JVM was blocked the entire time without detailed log? You should add to the question all the info you have. – fglez Jan 31 '13 at 15:52