1

We run Oracle Coherence in

Java HotSpot(TM) 64-Bit Server VM (25.92-b14) for linux-amd64 JRE (1.8.0_92-b14)

It triggers few full GC but many minor GC, most of them below 20 milliseconds. Very occasionally, we get a long duration minor GC like this:

2020-02-04T23:23:11.589+0100: 190856.335: [GC (Allocation Failure) 2020-02-04T23:23:15.545+0100: 190860.291: [ParNew
Desired survivor size 178946048 bytes, new threshold 15 (max 15)
- age   1:    1020040 bytes,    1020040 total
- age   2:     178320 bytes,    1198360 total
- age   3:     242304 bytes,    1440664 total
- age   4:      74328 bytes,    1514992 total
- age   5:     173944 bytes,    1688936 total
- age   6:     116680 bytes,    1805616 total
- age   7:        744 bytes,    1806360 total
- age   8:         96 bytes,    1806456 total
- age  12:      56800 bytes,    1863256 total
- age  14:         96 bytes,    1863352 total
- age  15:      56960 bytes,    1920312 total
: 357855K->5553K(699072K), 0.0133748 secs] 668006K->315703K(5941952K), 3.9693349 secs] [Times: user=0.22 sys=0.01, real=3.97 secs]
2020-02-04T23:23:15.559+0100: 190860.305: Total time for which application threads were stopped: 3.9703317 seconds, Stopping threads took: 0.0001269 seconds

As you can see, total clock time was 3.97 seconds. This duration is also reported in the total GC time, 3.9693349 secs, and in the time for which application threads were stopped, but not in the ParNew GC time, which was only 0.0133748. Eden and survivor reported sizes were similar to those in faster GC. Thanks to -XX:+PrintSafepointStatistics the JVM printed this breakdown for the GC:

   vmop      [threads: total initially_running wait_to_block]  [time: spin block sync cleanup vmop] page_trap_count
190852.438: GenCollectForAllocation   [  167    1     1  ]      [ 0     0     0     0    24    ]  0
190856.328: GenCollectForAllocation   [  167    1     1  ]      [ 0     0     0     0  3969    ]  0
190860.312: RevokeBias                [  167    1     2  ]      [ 0     0     0     0     0    ]  1
190860.312: RevokeBias                [  168    0     3  ]      [ 0     0     3     0     0    ]  0
190860.312: ThreadDump                [  168    0     0  ]      [ 0     0     0     0     2    ]  0

That is, most of the time was reported as vmop time for GenCollectForAllocation, which I understand as the GC operation itself. Hence, little time in reaching safepoint status. What could be the reason? Possible solutions?

Thanks in advance.

This paragraph was added later. We found a correlation between these long minor GC and poor I/O performance, as reported by the number of processes waiting for I/O. This points to the problem explained by this great article: https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic. One of the symptoms of that problem was a high difference in Times between real and user+sys, as we have. We even considered a possible solution not quoted by this article: give up on printing GC logs. But finally this was not needed as a disk hardware change removed I/O inefficiencies and this solved the lenghtly minor GC problem.

Corral
  • 86
  • 6
  • to be honest, I don't understand the format of the logs too much - did you happen to trunk it somehow? but if it's really that much for young generation, it sounds like you have a big heap with many references from old to young (remembered sets and/or card table rings a bell?), but anyway you need to post verbose logs, its impossible to tell otherwise what is going on. – Eugene Feb 22 '20 at 01:00
  • The main GC message might have a unusual look because the option `-XX:+PrintTenuringDistribution` printed the breakdown of survivor objects memory per age in the middle of the known GC message. If we removed it, it will look more familiar: `2020-02-04T23:23:11.589+0100: 190856.335: [GC (Allocation Failure) 2020-02-04T23:23:15.545+0100: 190860.291: [ParNew: 357855K->5553K(699072K), 0.0133748 secs] 668006K->315703K(5941952K), 3.9693349 secs] [Times: user=0.22 sys=0.01, real=3.97 secs]` – Corral Feb 24 '20 at 10:25
  • 1
    [there seem to be a lot of tasks done at cleanup phase](http://hg.openjdk.java.net/jdk/jdk13/file/03468b206457/src/hotspot/share/runtime/safepoint.cpp#l651), you might need to enable tracing to find out what is really going on. – Eugene Feb 24 '20 at 12:36
  • and btw `vmop` _is_ the safepoint request (it stands for `VM operation`), [here are some details](https://stackoverflow.com/questions/59832510/times-in-g1gc-logs/59849584#59849584), for example. – Eugene Feb 24 '20 at 15:00

0 Answers0