1

I have the same issue described in Why is the JVM suddenly shrinking eden space and enlarging old space? (causing performance issues) with the Tomcat 8.5.73 image running the following openJDK version :

openjdk 11.0.13 2021-10-19
OpenJDK Runtime Environment 18.9 (build 11.0.13+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8, mixed mode, sharing)

The cluster is deployed to Azure Kubernetes Service (AKS).

Here the JVM args I use :

-Xms19500M -Xmx19500M -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:MaxGCPauseMillis=400 -XX:+UnlockExperimentalVMOptions -XX:G1MaxNewSizePercent=80 -Xlog:gc*=info:file=/usr/local/tomcat/logs/gc.log:time,tags,level

I was able to reproduce the issue several times during performance tests where 1 pod out of 4 had its Eden Space shrink and Old Gen grow dramatically and almost simultaneously and very high increase in GC Pause time as shown in the following screenshots (unhealthy pod is shown in yellow) :

Below is the GC log giving more detail about the issue when it occurred :

[2022-01-25T19:07:59.419+0000][info][gc,heap] Heap region size: 8M
[2022-01-25T19:08:04.696+0000][info][gc     ] Using G1
[2022-01-25T19:08:04.696+0000][info][gc,heap,coops] Heap address: 0x000000033d000000, size: 19504 MB, Compressed Oops mode: Zero based, Oop shift amount: 3
[2022-01-25T19:08:04.697+0000][info][gc,cds       ] Mark closed archive regions in map: [0x00000007bf800000, 0x00000007bf868ff8]
[2022-01-25T19:08:04.697+0000][info][gc,cds       ] Mark open archive regions in map: [0x00000007bf700000, 0x00000007bf746ff8]
[...]
[2022-01-25T19:23:57.286+0000][info][gc,start      ] GC(36) Pause Young (Normal) (G1 Evacuation Pause)
[2022-01-25T19:23:57.287+0000][info][gc,task       ] GC(36) Using 11 workers of 11 for evacuation
[2022-01-25T19:23:57.366+0000][info][gc,phases     ] GC(36)   Pre Evacuate Collection Set: 0.2ms
[2022-01-25T19:23:57.366+0000][info][gc,phases     ] GC(36)   Evacuate Collection Set: 73.3ms
[2022-01-25T19:23:57.366+0000][info][gc,phases     ] GC(36)   Post Evacuate Collection Set: 5.7ms
[2022-01-25T19:23:57.366+0000][info][gc,phases     ] GC(36)   Other: 0.6ms
[2022-01-25T19:23:57.366+0000][info][gc,heap       ] GC(36) Eden regions: 1893->0(1898)
[2022-01-25T19:23:57.366+0000][info][gc,heap       ] GC(36) Survivor regions: 57->52(244)
[2022-01-25T19:23:57.366+0000][info][gc,heap       ] GC(36) Old regions: 64->65
[2022-01-25T19:23:57.366+0000][info][gc,heap       ] GC(36) Humongous regions: 0->0
[2022-01-25T19:23:57.366+0000][info][gc,metaspace  ] GC(36) Metaspace: 114532K->114532K(1153024K)
[2022-01-25T19:23:57.366+0000][info][gc            ] GC(36) Pause Young (Normal) (G1 Evacuation Pause) 16098M->921M(19504M) 79.885ms
[2022-01-25T19:23:57.366+0000][info][gc,cpu        ] GC(36) User=0.86s Sys=0.00s Real=0.08s
[2022-01-25T19:24:12.305+0000][info][gc,start      ] GC(37) Pause Young (Normal) (G1 Evacuation Pause)
[2022-01-25T19:24:12.305+0000][info][gc,task       ] GC(37) Using 11 workers of 11 for evacuation
[2022-01-25T19:24:12.376+0000][info][gc,phases     ] GC(37)   Pre Evacuate Collection Set: 0.2ms
[2022-01-25T19:24:12.376+0000][info][gc,phases     ] GC(37)   Evacuate Collection Set: 63.7ms
[2022-01-25T19:24:12.376+0000][info][gc,phases     ] GC(37)   Post Evacuate Collection Set: 6.2ms
[2022-01-25T19:24:12.376+0000][info][gc,phases     ] GC(37)   Other: 0.6ms
[2022-01-25T19:24:12.376+0000][info][gc,heap       ] GC(37) Eden regions: 1898->0(80)
[2022-01-25T19:24:12.376+0000][info][gc,heap       ] GC(37) Survivor regions: 52->41(244)
[2022-01-25T19:24:12.376+0000][info][gc,heap       ] GC(37) Old regions: 65->65
[2022-01-25T19:24:12.376+0000][info][gc,heap       ] GC(37) Humongous regions: 0->0
[2022-01-25T19:24:12.376+0000][info][gc,metaspace  ] GC(37) Metaspace: 114534K->114534K(1153024K)
[2022-01-25T19:24:12.376+0000][info][gc            ] GC(37) Pause Young (Normal) (G1 Evacuation Pause) 16105M->834M(19504M) 70.736ms
[2022-01-25T19:24:12.376+0000][info][gc,cpu        ] GC(37) User=0.75s Sys=0.00s Real=0.07s
[2022-01-25T19:24:12.980+0000][info][gc,start      ] GC(38) Pause Young (Normal) (G1 Evacuation Pause)
[2022-01-25T19:24:12.980+0000][info][gc,task       ] GC(38) Using 11 workers of 11 for evacuation
[2022-01-25T19:24:13.087+0000][info][gc,phases     ] GC(38)   Pre Evacuate Collection Set: 0.2ms
[2022-01-25T19:24:13.087+0000][info][gc,phases     ] GC(38)   Evacuate Collection Set: 104.9ms
[2022-01-25T19:24:13.087+0000][info][gc,phases     ] GC(38)   Post Evacuate Collection Set: 1.3ms
[2022-01-25T19:24:13.087+0000][info][gc,phases     ] GC(38)   Other: 0.5ms
[2022-01-25T19:24:13.087+0000][info][gc,heap       ] GC(38) Eden regions: 80->0(1933)
[2022-01-25T19:24:13.087+0000][info][gc,heap       ] GC(38) Survivor regions: 41->16(16)
[2022-01-25T19:24:13.087+0000][info][gc,heap       ] GC(38) Old regions: 65->111
[2022-01-25T19:24:13.087+0000][info][gc,heap       ] GC(38) Humongous regions: 0->0
[2022-01-25T19:24:13.087+0000][info][gc,metaspace  ] GC(38) Metaspace: 114534K->114534K(1153024K)
[2022-01-25T19:24:13.087+0000][info][gc            ] GC(38) Pause Young (Normal) (G1 Evacuation Pause) 1474M->1006M(19504M) 107.010ms
[2022-01-25T19:24:13.087+0000][info][gc,cpu        ] GC(38) User=1.16s Sys=0.00s Real=0.10s
[2022-01-25T19:24:13.857+0000][info][gc,start      ] GC(39) Pause Young (Normal) (G1 Evacuation Pause)
[2022-01-25T19:24:13.857+0000][info][gc,task       ] GC(39) Using 11 workers of 11 for evacuation
[2022-01-25T19:24:14.026+0000][info][gc,phases     ] GC(39)   Pre Evacuate Collection Set: 0.2ms
[2022-01-25T19:24:14.026+0000][info][gc,phases     ] GC(39)   Evacuate Collection Set: 166.8ms
[2022-01-25T19:24:14.026+0000][info][gc,phases     ] GC(39)   Post Evacuate Collection Set: 1.5ms
[2022-01-25T19:24:14.026+0000][info][gc,phases     ] GC(39)   Other: 0.5ms
[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Eden regions: 105->0(105)
[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Survivor regions: 16->16(16)
[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Old regions: 111->188
[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Humongous regions: 0->0
[2022-01-25T19:24:14.026+0000][info][gc,metaspace  ] GC(39) Metaspace: 114535K->114535K(1153024K)
[2022-01-25T19:24:14.026+0000][info][gc            ] GC(39) Pause Young (Normal) (G1 Evacuation Pause) 1846M->1623M(19504M) 169.080ms
[2022-01-25T19:24:14.026+0000][info][gc,cpu        ] GC(39) User=1.83s Sys=0.00s Real=0.17s

The adjusted/target Eden region size shrinks dramatically to 80 regions in GC(37) :

[2022-01-25T19:24:12.376+0000][info][gc,heap       ] GC(37) Eden regions: 1898->0(80)

One second later on GC(38), adjusted/target Eden region is reverted, but Old Gen regions size starts to increase :

[2022-01-25T19:24:13.087+0000][info][gc,heap       ] GC(38) Eden regions: 80->0(1933)
[...]
[2022-01-25T19:24:13.087+0000][info][gc,heap       ] GC(38) Old regions: 65->111

Another second later on GC(39), Eden is shrinked again, probably because not many objects were allocated during the time frame :

[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Eden regions: 105->0(105)
[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Survivor regions: 16->16(16)
[2022-01-25T19:24:14.026+0000][info][gc,heap       ] GC(39) Old regions: 111->188

Starting from there GC interval is reduced to 1s while it was 15s before and healthy pod which explains the very long GC pauses.

I also noticed that few minutes before the problem occurred, GC Pause time for the healthy pods was in the [70-100] ms range while it was [100-160] ms for the unhealthy pod :

  • GC Pause healthy pod
  • GC Pause unhealthy pod : we also see that the 2 GC before GC(37), GC Pause time decreased back under 100ms. Maybe G1 falsely computed/predicted its target/adujsted Eden region size because of that.

I have a few questions :

  • What are the G1 parameters that I could try to avoid this situation? More specifically to better drive target/adjusted Eden region size.
  • What GC log selector and level do I need to enable to better understand and troubleshoot? I've tried gc*=debug but it dumps 100+ lines on each GC and it is quite hard to get what's going for a profane like me :)
  • Could it be a bug in OpenJDK ?
guiliguili
  • 11
  • 1
  • Could you provide the gc logs about the 26th~35th and 40th times? Thank You. – ceragon Feb 01 '22 at 15:04
  • Unfortunately I don't have access to the full GC logs anymore. I managed to overcome the issue by setting `G1NewSizePercent` to a higher value – guiliguili Feb 02 '22 at 17:34

0 Answers0