2

Basics; Java 8 Oracle, 8 CPU server 4GB heap.

We have a performance problem with our server and have narrowed it down to GC. Suddenly the GC performance tanks and "Ext Root Scanning" is the cause. With 1 GC thread the total time for this phase is around 400ms. With 2 threads it is 800ms ( so real time still 400ms) with 8 threads it's 3200ms ( still 400ms real time)

This suggests to me there is some "fixed cost " of 400ms per thread - this is the problem I need to get to the bottom of.

Extract from GC log - the first GC is normal, the one after that is abnormal;

2020-07-28T20:22:47.119+0200: 398.095: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 127926272 bytes, new threshold 15 (max 15)
- age   1:   36475752 bytes,   36475752 total
- age   2:    4016576 bytes,   40492328 total
- age   3:    2422456 bytes,   42914784 total
- age   6:       1936 bytes,   42916720 total
- age   7:     478904 bytes,   43395624 total
, 0.1002403 secs]
   [Parallel Time: 67.9 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 398096.0, Avg: 398096.1, Max: 398096.1, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 7.7, Avg: 10.9, Max: 23.3, Diff: 15.6, Sum: 86.8]
      [Update RS (ms): Min: 0.0, Avg: 11.6, Max: 14.5, Diff: 14.5, Sum: 93.2]
         [Processed Buffers: Min: 0, Avg: 36.5, Max: 55, Diff: 55, Sum: 292]
      [Scan RS (ms): Min: 0.1, Avg: 0.2, Max: 0.3, Diff: 0.1, Sum: 1.8]
      [Code Root Scanning (ms): Min: 0.5, Avg: 0.8, Max: 1.2, Diff: 0.7, Sum: 6.8]
      [Object Copy (ms): Min: 43.2, Avg: 44.1, Max: 44.8, Diff: 1.6, Sum: 352.4]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
         [Termination Attempts: Min: 1, Avg: 70.3, Max: 87, Diff: 86, Sum: 562]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.6]
      [GC Worker Total (ms): Min: 67.7, Avg: 67.7, Max: 67.8, Diff: 0.1, Sum: 541.8]
      [GC Worker End (ms): Min: 398163.8, Avg: 398163.8, Max: 398163.9, Diff: 0.1]
   [Code Root Fixup: 0.3 ms]
   [Code Root Purge: 0.1 ms]
   [Clear CT: 0.6 ms]
   [Other: 31.3 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 26.8 ms]
      [Ref Enq: 1.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 2.4 ms]
   [Eden: 1885.0M(1885.0M)->0.0B(1854.0M) Survivors: 66.0M->118.0M Heap: 2861.0M(3356.0M)->1026.8M(3356.0M)]
 [Times: user=0.59 sys=0.00, real=0.11 secs] 
2020-07-28T20:23:43.857+0200: 454.820: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 129499136 bytes, new threshold 15 (max 15)
- age   1:   60550712 bytes,   60550712 total
- age   2:   35224048 bytes,   95774760 total
- age   3:    4015112 bytes,   99789872 total
- age   4:    2420472 bytes,  102210344 total
- age   7:       1856 bytes,  102212200 total
- age   8:     478904 bytes,  102691104 total
, 0.4842469 secs]
   [Parallel Time: 458.1 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 454820.7, Avg: 454820.8, Max: 454820.9, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 403.6, Avg: 404.2, Max: 406.3, Diff: 2.8, Sum: 3233.9]
      [Update RS (ms): Min: 8.5, Avg: 10.1, Max: 10.4, Diff: 1.9, Sum: 81.0]
         [Processed Buffers: Min: 25, Avg: 33.3, Max: 51, Diff: 26, Sum: 266]
      [Scan RS (ms): Min: 0.2, Avg: 0.3, Max: 0.4, Diff: 0.2, Sum: 2.1]
      [Code Root Scanning (ms): Min: 0.2, Avg: 1.1, Max: 2.1, Diff: 1.9, Sum: 8.9]
      [Object Copy (ms): Min: 41.2, Avg: 42.1, Max: 43.2, Diff: 2.0, Sum: 336.8]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
         [Termination Attempts: Min: 1, Avg: 68.8, Max: 91, Diff: 90, Sum: 550]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.6]
      [GC Worker Total (ms): Min: 457.9, Avg: 457.9, Max: 458.0, Diff: 0.2, Sum: 3663.5]
      [GC Worker End (ms): Min: 455278.7, Avg: 455278.7, Max: 455278.8, Diff: 0.1]
   [Code Root Fixup: 0.4 ms]
   [Code Root Purge: 0.1 ms]
   [Clear CT: 0.7 ms]
   [Other: 25.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 21.4 ms]
      [Ref Enq: 0.7 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 1.9 ms]
   [Eden: 1854.0M(1854.0M)->0.0B(56.0M) Survivors: 118.0M->111.0M Heap: 2880.8M(3356.0M)->1019.8M(3356.0M)]
 [Times: user=0.92 sys=0.00, real=0.48 secs] 

Any suggestions on what this could be or how to diagnose ?

  • the time taken to stop all threads is not the cause ( enabled those flags to measure this then disabled them again )
  • unable to reproduce on a different server with same application and GC settings but different load.
  • the slow GC times aren't immediate but around 5 minutes after application start up, possibly coinciding with a burst of CXF incoming traffic
  • EVERY GC after this has the same problem
Dom54
  • 21
  • 3
  • This is an interesting issue – MartinByers Jul 29 '20 at 12:09
  • at least in jdk-15 this phase has some [inner sub-phases](https://hg.openjdk.java.net/jdk/jdk15/file/0368f3a073a9/src/hotspot/share/gc/g1/g1GCPhaseTimes.cpp#l408) that can be seen [here](https://hg.openjdk.java.net/jdk/jdk15/file/0368f3a073a9/src/hotspot/share/gc/g1/g1GCPhaseTimes.hpp#l49). The good thing is that if you enable trace for the GC, you will be able to see a lot more output. The 5 minute might be an indication of JIT activity, btw. – Eugene Jul 29 '20 at 13:25
  • 1
    Tentatively, we have determined the cause to be a popular application monitoring tool. With the agent enabled this happens, without it, no problem... to be continued.. – Dom54 Aug 03 '20 at 07:24
  • why not post this as an answer or add these details to the question? this sounds too interesting. I would gladly look further if you provide details – Eugene Aug 11 '20 at 02:30
  • In the end I was unable to prove anything either way (tried it in a different environment - env number 3 - and couldnt reproduce it). Its got to be something vmware or AV related. – Dom54 Oct 05 '20 at 18:39

0 Answers0