We run Oracle Coherence in
Java HotSpot(TM) 64-Bit Server VM (25.92-b14) for linux-amd64 JRE (1.8.0_92-b14)
It triggers few full GC but many minor GC, most of them below 20 milliseconds. Very occasionally, we get a long duration minor GC like this:
2020-02-04T23:23:11.589+0100: 190856.335: [GC (Allocation Failure) 2020-02-04T23:23:15.545+0100: 190860.291: [ParNew
Desired survivor size 178946048 bytes, new threshold 15 (max 15)
- age 1: 1020040 bytes, 1020040 total
- age 2: 178320 bytes, 1198360 total
- age 3: 242304 bytes, 1440664 total
- age 4: 74328 bytes, 1514992 total
- age 5: 173944 bytes, 1688936 total
- age 6: 116680 bytes, 1805616 total
- age 7: 744 bytes, 1806360 total
- age 8: 96 bytes, 1806456 total
- age 12: 56800 bytes, 1863256 total
- age 14: 96 bytes, 1863352 total
- age 15: 56960 bytes, 1920312 total
: 357855K->5553K(699072K), 0.0133748 secs] 668006K->315703K(5941952K), 3.9693349 secs] [Times: user=0.22 sys=0.01, real=3.97 secs]
2020-02-04T23:23:15.559+0100: 190860.305: Total time for which application threads were stopped: 3.9703317 seconds, Stopping threads took: 0.0001269 seconds
As you can see, total clock time was 3.97 seconds. This duration is also reported in the total GC time, 3.9693349 secs, and in the time for which application threads were stopped, but not in the ParNew GC time, which was only 0.0133748. Eden and survivor reported sizes were similar to those in faster GC. Thanks to -XX:+PrintSafepointStatistics
the JVM printed this breakdown for the GC:
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
190852.438: GenCollectForAllocation [ 167 1 1 ] [ 0 0 0 0 24 ] 0
190856.328: GenCollectForAllocation [ 167 1 1 ] [ 0 0 0 0 3969 ] 0
190860.312: RevokeBias [ 167 1 2 ] [ 0 0 0 0 0 ] 1
190860.312: RevokeBias [ 168 0 3 ] [ 0 0 3 0 0 ] 0
190860.312: ThreadDump [ 168 0 0 ] [ 0 0 0 0 2 ] 0
That is, most of the time was reported as vmop
time for GenCollectForAllocation
, which I understand as the GC operation itself. Hence, little time in reaching safepoint status. What could be the reason? Possible solutions?
Thanks in advance.
This paragraph was added later. We found a correlation between these long minor GC and poor I/O performance, as reported by the number of processes waiting for I/O. This points to the problem explained by this great article: https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic. One of the symptoms of that problem was a high difference in Times between real and user+sys, as we have. We even considered a possible solution not quoted by this article: give up on printing GC logs. But finally this was not needed as a disk hardware change removed I/O inefficiencies and this solved the lenghtly minor GC problem.