JVM Tenured/Old gen reached limit & server hanging

Question

Our application requires very huge memory since it deals with very large data. Hence we increased our max heap size to 12GB (-Xmx).

Following are the environment details

OS - Linux 2.6.18-164.11.1.el5    
JBoss - 5.0.0.GA
VM Version - 16.0-b13 Sun JVM
JDK - 1.6.0_18

We have above env & configuration in our QA & prod. In QA we have max PS Old Gen (Heap memory) allocated as 8.67GB whereas in Prod it is just 8GB.

In Prod for a particular job Old Gen Heap reaches 8GB, hangs there and the web URL become inaccessible. Server is getting down. But in QA also it reaches 8.67GB but full GC is performed and its coming back to 6.5GB or something. Here its not getting hanged.

We couldn't figure out a solution for this because both the environment and configuration on both the boxes are same.

I have 3 questions here,

2/3rd of max heap will be allocated to old/tenured gen. If that is the case why it is 8GB in one place and 8.67GB in another place?

How to provide a valid ratio for New and Tenure in this case(12GB)?

Why it is full GCed in one place and not in the other?

Any help would be really appreciable. Thanks.

Pls let me know if you need further details on env or conf.

full cmd line for the jvm would be good, minimally the -XX switches that configure the garbage collector. It's difficult to comment on gc issues without knowing how the vm is configured. btw you put -Xpx in the Q, you mean -Xmx really? — Matt, May 09 '11 at 16:57
what is a job? how long does it last? how much work does it do? aka how much garbage does it generate? gc configuration is largely driven by application behaviour and pause time goals. — Matt, May 09 '11 at 17:00
@Matt - It was a typo :( i edited it to -Xmx. Following is the full cmd line for jvm in both environmnts -- -Xms1003m -Xmx13312m -XX:MaxPermSize=256m -Dorg.jboss.resolver.warning=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Dsun.lang.ClassLoader.allowArraySyntax=true — raksja, May 10 '11 at 04:35
curious, I thought you needed to pass -d64 to support a heap of that size. — Matt, May 10 '11 at 09:23
@Matt - m not sure wat a -d64 is. If your asking on core i can say its a 64-bit at both the end. Its a backend job, each job will take 3 hours to complete and almost 6 jobs running one after another. Last job when running reaches 8GB max and getting hang in prod. — raksja, May 10 '11 at 11:51
The -d64 option is not necessary on Linux. The installed Java is either 32-bit or 64-bit and whichever is first on the path is used. For more details see http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#64bit_selection — WhiteFang34, May 16 '11 at 16:06

score 23 · Accepted Answer · answered May 10 '11 at 09:23

23

For your specific questions:

The default ratio between new and old generations can depend on the system and what the JVM determines will be best.
To specify a specific ratio between new and old generations with -XX:NewRatio=3.
If your JVM is hanging and the heap is full it's probably stuck doing constant GC's.

It sounds like you need more memory for prod. If on QA the request finishes then perhaps that extra 0.67GB is all that it needs. That doesn't seem to leave you much headroom though. Are you running the same test on QA as will happen on prod?

Since you're using 12GB you must be using 64-bit. You can save the memory overhead of 64-bit addressing by using the -XX:+UseCompressedOops option. It typically saves 40% memory, so your 12GB will go a lot further.

Depending on what you're doing the concurrent collector might be better as well, particularly to reduce long GC pause times. I'd recommend trying these options as I've found them to work well:

-Xmx12g -XX:NewRatio=4 -XX:SurvivorRatio=8 -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSClassUnloadingEnabled
-XX:+CMSScavengeBeforeRemark -XX:CMSInitiatingOccupancyFraction=68

answered May 10 '11 at 09:23

WhiteFang34

70,765
18
106
111

Thanks for your reply. I presume that compressedOops will reduce the performance when heap is more (in our case 12GB). Any1 pls share your thoughts if you have used compressed option in huge heap. – raksja May 12 '11 at 11:46
2

We've done numerous large scale benchmarks with a 24GB heap. The `-XX:+UseCompressedOops` option does not measurably reduce performance. If you're in a limited memory situation then it can definitely improve performance. Especially if it avoids getting dangerously low on memory or running out, as it sounds like your case is. For 12GB it'll effectively be like giving it 16GB without `-XX:+UseCompressedOops`. – WhiteFang34 May 12 '11 at 17:39
@Fang - Thanks for your reply. With your suggestion I have enabled `-XX:+UseCompressedOops` option alone. I will post you the result sooner. By the way, we are planning to try whatever options you have suggested above. If you could elaborate reason on each of the options above, it would be more helpful to us. Thanks in advance. – raksja May 13 '11 at 10:31
2

`-XX:+UseCompressedOops` did the trick. I have added only this parameter extra and it is working now with 12GB heap itself. Within 8GB of old gen its doing its processes and frequently GC/Compression happens which reduce it to 6.5GB. If you still insist us to make the ratio specification, then pls explain on that. Thanks. – raksja May 16 '11 at 11:46
3

@techastute: that's good to hear, glad it helped. The ratio specification is not necessarily required. It's just a convenient way to specify a reasonable default that would be consistent for the same heap size and scales well for different large heap sizes. All of the other options I recommended are focused on eliminating long GC pauses. For our scenario we found the concurrent collector often fell back to full GC and other long pauses without tuning. Those options are the result of many benchmark and production experiments over long periods of time with sustained heavy web traffic. YMMV. – WhiteFang34 May 16 '11 at 15:59

score 3 · Answer 2 · answered May 10 '11 at 12:42

you need to get some more data in order to know what is going on, only then will you know what needs to be fixed. To my mind that means

get detailed information about what the garbage collector is doing, these params are a good start (substitute some preferred path and file in place of gc.log)

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:gc.log -verbose:gc
repeat the run, scan through the gc log for the period when it is hanging & post back with that output
consider watching the output using visualgc (requires jstatd running on the server, one random link that explains how to do this setup is this one) which is part of jvmstat, this is a v easy way to see how the various generations in the heap are sized (though perhaps not for 6hrs!)

I also strongly recommend you do some reading too so you know what all these switches are referring to otherwise you'll be blindly trying stuff with no real understanding of why 1 thing helps and another doesn't. I'd start with the oracle java 6 gc tuning page which you can find here

I'd only suggest changing options once you have baselined performance. Having said that CompressedOops is v likely to be an easy win, you may want to note it has been defaulted to on since 6u23.

Finally you should consider upgrading the jvm, 6u18 is getting on a bit and performance keeps improving.

each job will take 3 hours to complete and almost 6 jobs running one after another. Last job when running reaches 8GB max and getting hang in prod

are these jobs related at all? this really sounds like a gradual memory leak if they're not working on the same dataset. If heap usage keeps going up and up and eventually blows then you have a memory leak. You should consider using -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/some/dir to catch a heap dump (though note with a 13G heap it will be a big file so make sure you have the disk space) if/when it blows. You can then use jhat to look at what was on the heap at the time.

I suppose we won't be able to analyze a dump which is more than 512MB using jhat. Is there any other way to analyze the dump? or pls suggest a solution. Thanks in advance. — raksja, May 12 '11 at 12:41
you can analyse bigger dumps than that, might depend on how much RAM you have mind you. — Matt, May 12 '11 at 14:29
I got the dump of **11GB** in my QA box where the out of memory is thrown. As I expected I couldn't analyze the dump with **JHAT** and it throws an OutOfMemory. :( Machine where I ran JHAT has 16procesr with 96GB of RAM. Any way to split and analyze that??? I used `jhat java_pid1491.hprof#1` — raksja, May 13 '11 at 10:20
man jhat for info on how to pass args to the jvm, e.g. `jhat -J-d64 -J-Xmx12G ` should give run jhat on a 64bit jvm with a 12G heap. It sounds like you have plenty of ram to deal with it. — Matt, May 13 '11 at 10:51
:) i even tried this `jhat -J-d64 -Xmx12g -XX:-UseBiasedLocking java_pid1491.hprof#1` - still it throws OOM. Any suggestions would be of greater help. Thnks — raksja, May 13 '11 at 15:36
its not working Matt. Every option of jhat i tried. Would you suggest any other heap analyser to use it in Linux platform? — raksja, May 14 '11 at 15:02
I have created a separate [STACK-QUESTION](http://stackoverflow.com/questions/6026959/jhat-throws-oom-when-trying-to-analyse-a-huge-heap-dump-11gb) for this OOM in JHAT. Please post your answers/comments in that. Thanks. — raksja, May 17 '11 at 06:34

JVM Tenured/Old gen reached limit & server hanging

2 Answers2