5

After trying numerous different JVM GC setting and doing a lot of testing where I was having problems with long major GC pauses I'm now testing with G1GC JVM GC. Beside this I'm also collecting data with performance monitor and only applications that is running (beside system services,...) is GlassFish server with my application. I didn't find anything strange in performance monitor log (CPU usage is around 5-10% and it gets a little bit higher when GC occurs, memory is about 60% used,...). It's now fifth day of testing and I noticed following:

enter image description here

Until second major (mixed) GC occurred everything was fine (minor GC were about 20ms long, GC performance was 160000M/s,...). The second major GC took around 2 seconds (long - first took 150ms, but not critical) and after that minor GC are much longer then before (see grey lines in picture which are representing a duration of minor (young) GC) and GC performance is only 12000M/s and is still dropping. I'm wondering why is this happening after second major GC even no other applications are running and CPU/memory usage is same as before. I don't know what is happening here. I also have one more question - I'm running same test on different PC which has less RAM and older processor and GC performance there is around 5000M/s (minor GC are around 50-100ms) which I think it's normal because of worse processor and less RAM. What is strange is that major GC didn't happened yet after 3 days of running and old generation is increasing much slower than here even setting are same. Why is increasing much slower (here around 150MB in two days, on second PC 80 MB in three days)? Thank you for all your answers, I don't know why is GC acting so abnormally (first it's working fine and then the performances are worsening).

EDIT: here is complete GC log file which was imported in GCViewer and also event details statistic from GCViewer:

enter image description here

Log for 3rd major GC:

2015-06-08T08:09:13.123+0200: 572815.533: [GC concurrent-root-region-scan-start]
2015-06-08T08:09:13.139+0200: 572815.560: [GC concurrent-root-region-scan-end, 0.0271771 secs]
2015-06-08T08:09:13.139+0200: 572815.560: [GC concurrent-mark-start]
2015-06-08T08:09:16.302+0200: 572818.721: [GC concurrent-mark-end, 3.1612900 secs]
2015-06-08T08:09:16.318+0200: 572818.729: [GC remark 572818.729: [Finalize Marking, 0.0002590 secs] 572818.729: [GC ref-proc, 0.4479462 secs] 572819.177: [Unloading, 3.2004912 secs], 3.6499382 secs]
 [Times: user=0.20 sys=0.08, real=3.64 secs] 

Again, real time was much higher than user + sys, Unloading phase took more than 3 seconds.

user4341206
  • 647
  • 1
  • 7
  • 26
  • Please describe what do colours mean and what are this growing lines represent. And can we see the durations of major GC here? If yes, where are they? – AdamSkywalker Jun 06 '15 at 18:14
  • can you provide the raw GC logs for the time span of interest? also make sure to log with `-XX:+PrintGCDetails` – the8472 Jun 06 '15 at 18:22
  • Ok, purple line is tenured (old) generation, blue lines are used heap (so basically young generation because it starts from old generation line) together, the grey lines below are minor GC times, major GC times (concurrent collections for old generation; not full GC) are two yellow lines where heap size drops (first took 0,1s and second almost 2 seconds). This graph is from GCViewer application. I'll upload raw GC logs right away. – user4341206 Jun 06 '15 at 18:27
  • I edited my post adding GC log and event details from GCViewer. – user4341206 Jun 06 '15 at 18:37
  • i would recommend gists or pastebin for log files, not file hosting services. – the8472 Jun 06 '15 at 18:41
  • I tried to share it using pastebin but I wasn't able to paste all log code (5MB size)... – user4341206 Jun 06 '15 at 19:20
  • Interestingly, the ~2s pause was a remark at `2015-06-05T09:14:51.909+0200` (line 307602), which reports `[Times: user=0.17 sys=0.00, real=1.96 secs]`. It seems to me that you have either a resource problem or your application takes too long to reach a safe point. – Stefan Ferstl Jun 07 '15 at 07:23
  • I monitored system using Performance monitor and I didn't find out that there could be resource problems. What causes application to reach a safe point and how can I find out if that is a problem? – user4341206 Jun 07 '15 at 10:53
  • @user4341206 any progress here? – AdamSkywalker Jul 05 '15 at 20:11
  • Hello, I think I solved problems with long GC pauses. In last month (the server is running 24/7) the longest GC pause was 190ms. I believe adding -XX:+PerfDisableSharedMem flag to JVM eliminated this problem (I read about this flag here: evanjones.ca/jvm-mmap-pause.html)... Beside this flag I added few more: -XX:+ParallelRefProcEnabled, -XX:MetaspaceSize=128m,... If I'll again come across this problem (long GC times) I'll let you know but I hope this was the real solution. – user4341206 Jul 06 '15 at 07:31

2 Answers2

1

Can't give you a verified explanation without deep diving into the problem, but let's try.

You most likely read G1 GC oracle documentation and know that this collector was made for applications with huge heap. It was designed to reduce long pauses during major garbage collections.

In your example we can see that minor GC pauses increase right after secong major GC. I do not know the internal mechanics of G1, but intuition tells me that second major GC was an internal "signal" for G1 to analyze its performance (does he achieve throughput goals/pause goals). So it seems to me, that after this analysis G1 decided to change its internal parameters. I bet that long minor GC pause indicates that G1 increased young generation space.

Why he did it? Well, bigger young generation will decrease promotion speed to older generation. So old generation space will be filled slower. Most likely G1 tried to prevent next major GC pause to be that long as previous (2 seconds).

Again, I can't guarantee that things work like I said without deep analysis, but I try to give you directions.

What about your "slower" machine, that worked 3 days without major GC pause - it's really hard to say without knowing GC configuration and how your application works. I bet that application load was lower.

AdamSkywalker
  • 11,408
  • 3
  • 38
  • 76
  • Thank you for your answer. What you wrote make sense why minor GC are longer - so old generation space will be filled slower, I think this can be seen also from graph (purple line is increasing slower than before). I did not think of that. Yes I read documentation of all GC and I also tried all of them but I got long pauses from all of them except G1GC (http://stackoverflow.com/questions/30566779/jvm-gc-problems) for now. I also used Performance monitor but I did not find anything useful. Maybe G1GC also tried to prevent major GC on slower machine like here... – user4341206 Jun 06 '15 at 19:11
  • But if GC really decided so that means Throughput and GC performance would be lower (now GC performance is about 20% lower, Throughput for now only 0,02%). I'm really wondering how much time will next major GC last. Can you please tell me also if you recommend me not to use G1GC for my server because of small heap size? If so can you tell me what setting to use with CMS to get smaller pause times (see my other question which link I posted in previous comment? Thanks... – user4341206 Jun 06 '15 at 19:26
  • 1
    As was mentioned by @the8472, you can give G1 a pause goal by specifying flags. I did not mention that in answer, because I thought you've already done that. Anyway, I read your second question to get more details. For 512 Mb heap you have enormous long major GC. It is not a GC issue. I bet the problem is 1) in Glassfish server 2) Windows tools. I would try Tomcat/switching to Linux. I had a real-world experience, when Windows performance counters dramatically slowed the application. – AdamSkywalker Jun 06 '15 at 20:29
  • Yes, I've tried different setting already, including MaxGCPauseMillis but I decided to leave this setting at default G1GC value (200ms). I haven't tried to set GCTimeRatio yet, I'll try it... I also set many other flags which I came around reading different questions about JVM GC but none of them were helpful to reduce pause times with different GC so now I'm asking you what else can I do. If I'll not resolve this problem I'll try different servers / OS. But I'm still wondering what is causing this and I want to find a solution. – user4341206 Jun 06 '15 at 20:36
  • Good luck then, I'll check this topic sometimes, because it interested me) – AdamSkywalker Jun 06 '15 at 20:40
  • Hello, the 3rd major GC was even longer (more than 3 seconds). I edited my question with 3rd major GC log. I'm now restarting my server using these flags: -XX:GCTimeRatio=19, -XX:+PerfDisableSharedMem, -XX:+ParallelRefProcEnabled . I hope I'll get better results. – user4341206 Jun 08 '15 at 08:59
  • @user4341206 Looks like restarting server is an option :) still better than previous 15 seconds. Let's see what we'll have next. – AdamSkywalker Jun 08 '15 at 09:07
  • We'll see. I have one more question for you if you might know the answer. If I use -XX:-ClassUnloadingWithConcurrentMark flag to turn off class unloading after G1 concurrent mark where is class unloading executed then (which phase)? I've searched but I haven't found answer yet. @the8472 recommended me to use this flag to decrease unloading time in remark phase (which you can see was the longest in last GC remark phase). – user4341206 Jun 08 '15 at 09:49
1

Since you seem to be concerned about pause times, I do not see any pause time goals in your JVM flags.

you can set pause time goals via MaxGCMinorPauseMillis (minor only) and MaxGCPauseMillis (overall).

Those are just goals, the collector may fail to meet them for various reasons. By default the GC is only allowed to burn 1% of your CPU time on GCing. So if fails to meet its goals you can try setting GCTimeRatio=19 to grant it 5% of the time instead (exchanging throughput for pause times).


The longest pause in your logs seems to be this one:

2015-06-05T09:14:51.909+0200: 317539.613: [GC remark 317539.613: [Finalize Marking, 0.0003625 secs] 317539.613: [GC ref-proc, 0.9890475 secs] 317540.602: [Unloading, 0.9739555 secs], 1.9643775 secs]

You can try bringing down the reference processing time with -XX:+ParallelRefProcEnabled

the8472
  • 40,999
  • 5
  • 70
  • 122
  • I tried to set MaxGCPauseMillis with CMS but it didn't help (there were long pauses although)... With G1GC MaxGCPauseMillis is set to default of 200ms so I removed MaxGCPauseMillis flag because I want to use this default setting. I'll try with GCTimeRatio=19, I haven't heard about this option until now. – user4341206 Jun 06 '15 at 20:24
  • Thanks, I'll use this flag also. This should decrease ref-proc time if I understand correctly - what about class unloading time, is there a flag to decrease that time too? – user4341206 Jun 07 '15 at 10:54
  • the things discussed in the linked question regarding class unloading in CMS do apply here too. – the8472 Jun 07 '15 at 11:06
  • I came across this article: http://www.evanjones.ca/jvm-mmap-pause.html. Author says that adding -XX:+PerfDisableSharedMem flag should stop having real time much higher than user+sys time (I occasionally have similar symptoms). But I also read that this flag could cause some performance issues. Do you have any experience with setting this flag? – user4341206 Jun 08 '15 at 06:03