Recently, some of our servers have been crashing due to segfaults. Although I don't have a proven root cause, I do have a hunch that it relates to how our application is garbage collected, the GC tuning we've done, and the memory profile.
Investigating multiple occurrences of these crashes, there is a pattern I've identified from the point of view of the JVM:
- prior to crash, number of threads increases to an above-normal level
- prior to crash, the generally normal sawtooth pattern of overall heap usage goes away and the heap size grows without decreasing
- prior to crash, the heap's young generation is consistently low, and does not appear to resize or grow in usage
- prior to crash, the old generation grows to a size greater than any past old gen sizes, and does not appear to be cleaned up or collected
- the segfault always has to do with an active GC thread, specifically
copy_to_survivor_space
While I don't see hard evidence of an out of memory occurrence, I'm of the opinion that we are indeed running out of heap space for the application. If the G1GC cannot copy young objects to survivor space prior to evacuation or promotion, it seems to logically follow that it did not have sufficient space to do so. Analyzing the GC logs, I don't see much of anything to do with Humongous objects, to I don't think they're taking up a bunch of space in the heap.
Looking at the memory profile, my hunch is that I should descrease InitiatingHeapOccupancyPercent
to something closer to the default of 45 in order to trigger a collection cycle earlier. It seems to me, especially given the ever-growing size of the Old Gen, that a mixed/full GC needs to be triggered more often or at least earlier. How do I initiate a full/mixed collection?
Based on the information provided, are there other thoughts or opinions on how I can trigger collection sooner? Am I misinterpreting the segfault message and heading down the wrong path? What else can I do to gather information that might enable me to address the root cause of the crashes?
Detail
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f38aa2655f5, pid=6293, tid=0x00007f3894efe700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x5c85f5] G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45
#
JVM Options:
-XX:MaxHeapSize=30g
-XX:MetaspaceSize=256m
-XX:MaxMetaspaceSize=512m
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=70
-XX:-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UseStringDeduplication
-XX:+UseCompressedOops
-Xloggc:/usr/local/company/logs/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCCause
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintReferenceGC
-XX:+PrintTenuringDistribution
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/usr/local/company/logs/heapdump_126960.hprof