3

We have been struggling with what appears to be long stop the world pauses when using the G1 collector. I have read through the Oracle documentation but I am still having trouble determining how to interpret what is causing the long pauses and what do to about it. (GC Logs below)

Our instance is being monitored and I have information contained in the following graphs:

12 second stop the world pause Heap allocations

We have another monitoring tool that pings the JVM and I had it report the JVM has unresponsive for 12 seconds around the exact same time.

So that brings me to the question of what to do about this. The load on the server is very low, so this doesn't happen often, but it seems that over several hours the heap just grows and grows and then there is a huge GC event that can cause serious issues. Below is the configration we are using for the GC:

   wrapper.java.additional.40=-XX:+UseG1GC

   wrapper.java.additional.44=-XX:+ScavengeBeforeFullGC


   wrapper.java.additional.50=-XX:+PrintGCCause
   wrapper.java.additional.51=-XX:+PrintGCDetails
   wrapper.java.additional.52=-XX:+PrintGCTimeStamps
   wrapper.java.additional.53=-XX:+PrintGCApplicationStoppedTime
   wrapper.java.additional.54=-XX:+PrintGCApplicationConcurrentTime
   wrapper.java.additional.55=-verbose:gc
   wrapper.java.additional.56=-Xloggc:../../../logs/gc.log
   wrapper.java.additional.57-XX:+UseGCLogFileRotation
   wrapper.java.additional.58-XX:NumberOfGCLogFiles=10
   wrapper.java.additional.59-XX:GCLogFileSize=100M
   wrapper.java.additional.60=-XX:+PrintHeapAtGC
   wrapper.java.additional.61=-XX:+PrintTenuringDistribution
   wrapper.java.additional.62=-XX:+UseCompressedClassPointers
   wrapper.java.additional.63=-XX:+UseCompressedOops

Can anyone point me in the right direction here. Thanks!

GCEasy Analysis: http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMDYvMjcvLS1nYyAoMSkubG9nLnppcC0tMTgtNDEtNDA=

Update: Meta space graph

enter image description here

Update: GC Logs: https://dl.dropboxusercontent.com/u/3642047/gc.log.zip

Casey Jordan
  • 1,204
  • 1
  • 20
  • 39

1 Answers1

3

I see at least two problems in your GC logs:

  1. There were 4 full collections invoked explicitly by System.gc() or Runtime.getRuntme().gc(). Each was around 10 seconds long:

    277042.600: [Full GC (System.gc())  12G->1537M(5126M), 11.4203806 secs]
    

    You may want to add -XX:+ExplicitGCInvokesConcurrent JVM flag to prevent System.gc() from causing stop-the-world events.

    It will be also useful to find who calls System.gc() and probably avoid this call altogether. For this purpose you may modify Runtime.gc method by adding code to print stack trace. Then recompile Runtime.java and prepend the modified class to bootstrap classpath with
    -Xbootclasspath/p:/path/to/yourpatch.jar

  2. Another problem is an extremely long safepoint synchronization pause not related to GC:

    5512447.686: Total time for which application threads were stopped: 16.4426008 seconds, Stopping threads took: 16.4414390 seconds
    

    This is typically caused by MappedByteBuffer I/O or because Java process began swapping to disk. See this and this answers regarding similar problems.

apangin
  • 92,924
  • 10
  • 193
  • 247
  • Awesome observations, the System.gc should be easy to fix if we can trackdown what libs are calling it, also we do use a library which does a lot with memory mapped files, sometimes up to 70G or more. Any suggestions on what to tune to allow the OS to better handle that besides forcing it all in RAM? Thanks! – Casey Jordan Jul 13 '17 at 16:23
  • @CaseyJordan Unfortunately you can't do much about it but to ensure that file is in RAM or replace mapped I/O with regular FileChannel I/O or to disable Unsafe intrinsics in JVM (the latter may significantly decrease throughput). – apangin Jul 15 '17 at 08:03