2

I am experiencing infrequent long GC pauses with G1GC algorithm - 30+ seconds for a month. Once it happens, I restart my service and this delay won't happen again for next 1 month.

I am attaching GC log.

2016-06-15T02:30:17.407-0400: 1702455.276: Total time for which application threads were stopped: 0.2900480 seconds
2016-06-15T02:30:25.509-0400: 1702463.378: Total time for which application threads were stopped: 0.0018900 seconds
2016-06-15T02:30:29.817-0400: 1702467.686: [GC pause (young)
Desired survivor size 7340032 bytes, new threshold 15 (max 15)
- age   1:    1903912 bytes,    1903912 total
- age   2:     641440 bytes,    2545352 total
- age   3:     455856 bytes,    3001208 total
- age   4:     563544 bytes,    3564752 total
- age   5:    1873368 bytes,    5438120 total
- age   6:     326024 bytes,    5764144 total
- age   7:     299144 bytes,    6063288 total
- age   8:     304632 bytes,    6367920 total
- age   9:     273160 bytes,    6641080 total
- age  10:     309152 bytes,    6950232 total
- age  11:     156824 bytes,    7107056 total
- age  12:     135064 bytes,    7242120 total
, 0.1199490 secs]
   [Parallel Time: 118.2 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 1702467686.7, Avg: 1702467686.8, Max: 1702467686.9, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 18.4, Avg: 20.2, Max: 24.6, Diff: 6.2, Sum: 80.7]
      [Update RS (ms): Min: 75.5, Avg: 79.5, Max: 80.8, Diff: 5.3, Sum: 317.8]
         [Processed Buffers: Min: 52, Avg: 61.0, Max: 70, Diff: 18, Sum: 244]
      [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [Object Copy (ms): Min: 17.7, Avg: 18.1, Max: 18.6, Diff: 0.8, Sum: 72.3]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [GC Worker Other (ms): Min: 0.1, Avg: 0.1, Max: 0.1, Diff: 0.0, Sum: 0.4]
      [GC Worker Total (ms): Min: 117.8, Avg: 117.9, Max: 118.0, Diff: 0.2, Sum: 471.6]
      [GC Worker End (ms): Min: 1702467804.7, Avg: 1702467804.7, Max: 1702467804.7, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 1.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.3 ms]
      [Ref Enq: 0.0 ms]
      [Free CSet: 0.4 ms]
   [Eden: 100.0M(100.0M)->0.0B(100.0M) Survivors: 9216.0K->9216.0K Heap: 2164.6M(2198.0M)->2065.5M(2198.0M)]
 [Times: user=0.48 sys=0.00, real=37.62 secs]
2016-06-15T02:31:07.438-0400: 1702505.307: Total time for which application threads were stopped: 37.6223210 seconds

2016-06-15T02:31:07.443-0400: 1702505.312: Total time for which application threads were stopped: 0.0034690 seconds
2016-06-15T02:31:07.446-0400: 1702505.315: Total time for which application threads were stopped: 0.0018140 seconds
2016-06-15T02:31:07.451-0400: 1702505.320: Total time for which application threads were stopped: 0.0020130 seconds
2016-06-15T02:31:07.453-0400: 1702505.322: Total time for which application threads were stopped: 0.0015850 seconds

I am using below G1GC flags :

-server -d64 -verbose:gc -Xms2g -Xmx4g -XX:+UseG1GC 
-XX:MaxGCPauseMillis=1500 -XX:G1HeapRegionSize=2 
-XX:+PrintFlagsFinal -XX:ParallelGCThreads=4 -XX:ConcGCThreads=2 
-XX:PermSize=64m -XX:MaxPermSize=256m -XX:+DisableExplicitGC 
-XX:+PrintGCDateStamps -XX:+DoEscapeAnalysis -XX:+PrintTenuringDistribution 
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails 
-XX:+PrintGCApplicationStoppedTime -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=$PP_HOME -XX:+UseCompressedOops -XX:+AggressiveOpts"

Machine configuration:

Red Hat Enterprise Linux Server release 5.11 
Dual core CPU
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

How can we find the root cause for long GC from above statements?

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • I very much doubt that you kind find root causes from this information. Instead, you have to enable the GC tracing; and study its output files. – GhostCat Jun 17 '16 at 11:56
  • what is the switch to enable tracing? – Ravindra babu Jun 17 '16 at 15:27
  • 1
    looks like the GC itself isn't taking much CPU time and just wall time is long for some reason, you should enable safepoint logging, see http://stackoverflow.com/a/33293324/1362755 – the8472 Jun 18 '16 at 01:27
  • Is there any overhead in enabling safe point logging? – Ravindra babu Jun 18 '16 at 08:45
  • In addition, you should use a newer Java version (at least Java 8) when using G1. At least, that is our experience: G1 on Java 7 just was not production ready. Things got MUCH better in the latest Java 8u60+ for our purposes. So maybe that is a good shot for you. – Dominik Sandjaja Oct 19 '16 at 09:06

2 Answers2

1

Update to the latest version of Java 8 - you should not be using G1 with Java 7 because it was not production ready at that time. Odds are very good you have hit a now-patched bug. Yes, there are ways to get OpenJDK 8 onto a RHEL 5 box. Yes, your application will increase massively in speed.

Also, this:

-XX:MaxGCPauseMillis=1500

and this:

-Xms2g -Xmx4g

Would suggest you aren't running in the sort of large-heap, low-latency environment G1 is designed for. With a JRE that old, you may be better off with the default (ParallelGC) and maybe tuning a few options to reduce pause times (mileage may vary). Or CMS, which was more mature then and with tuning can be quite stable.

And for goodness sakes, if your app is critical enough that a 30 second GC pause every month is a problem then you can probably afford a system with RHEL 7 and more than dual cores.

BobMcGee
  • 19,824
  • 10
  • 45
  • 57
0

The real time here is 37 seconds where as the user time is 0.48 seconds and kernel is 0 seconds. That is, it took all of the GC threads 37 seconds to use 0.48 seconds of CPU time. IOWs, the GC threads are severely starved for CPU. Noisy neighbour?