jdk11 g1gc spend a lot of time on prepare TLABs

Question

gc.log shows that Prepare TLABs phase spends for about 57 second which is unaccpetable. Furthermore, this situation has only happened once in five days. I just want to figure out what happened accutally and how to avoid.

[gc.log]

[2021-08-02T11:38:38.134+0800][322490.377s][161325][safepoint     ] Entering safepoint region: G1CollectForAllocation
[2021-08-02T11:38:38.134+0800][322490.378s][161325][gc,start      ] GC(238) Pause Young (Normal) (G1 Evacuation Pause)
[2021-08-02T11:38:38.134+0800][322490.378s][161325][gc,task       ] GC(238) Using 18 workers of 18 for evacuation
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)   Pre Evacuate Collection Set: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Prepare TLABs: 57039.1ms
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Choose Collection Set: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Humongous Register: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)   Evacuate Collection Set: 3.8ms
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Ext Root Scanning (ms):   Min:  0.0, Avg:  0.4, Max:  3.5, Diff:  3.5, Sum:  7.2, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Update RS (ms):           Min:  0.0, Avg:  0.5, Max:  0.9, Diff:  0.9, Sum:  8.7, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)       Processed Buffers:        Min: 0, Avg:  2.9, Max: 16, Diff: 16, Sum: 53, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)       Scanned Cards:            Min: 0, Avg: 265.7, Max: 623, Diff: 623, Sum: 4783, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)       Skipped Cards:            Min: 0, Avg: 20.4, Max: 32, Diff: 32, Sum: 368, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Scan RS (ms):             Min:  0.0, Avg:  0.2, Max:  0.3, Diff:  0.3, Sum:  3.2, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)       Scanned Cards:            Min: 0, Avg: 27.3, Max: 120, Diff: 120, Sum: 491, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)       Claimed Cards:            Min: 0, Avg: 29.9, Max: 129, Diff: 129, Sum: 538, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)       Skipped Cards:            Min: 0, Avg: 187.6, Max: 347, Diff: 347, Sum: 3377, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Code Root Scanning (ms):  Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.1, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     AOT Root Scanning (ms):   skipped
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Object Copy (ms):         Min:  0.0, Avg:  2.0, Max:  2.6, Diff:  2.6, Sum: 35.7, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.422s][161325][gc,phases     ] GC(238)     Termination (ms):         Min:  0.0, Avg:  0.4, Max:  0.6, Diff:  0.6, Sum:  7.4, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)       Termination Attempts:     Min: 1, Avg:  4.8, Max: 9, Diff: 8, Sum: 86, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     GC Worker Other (ms):     Min:  0.0, Avg:  0.1, Max:  0.3, Diff:  0.2, Sum:  2.2, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     GC Worker Total (ms):     Min:  3.4, Avg:  3.6, Max:  3.7, Diff:  0.3, Sum: 64.7, Workers: 18
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)   Post Evacuate Collection Set: 1.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Code Roots Fixup: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Clear Card Table: 0.2ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Reference Processing: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Weak Processing: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Merge Per-Thread State: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Code Roots Purge: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Redirty Cards: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     DerivedPointerTable Update: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Free Collection Set: 0.3ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Humongous Reclaim: 0.1ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Start New Collection Set: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Resize TLABs: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)     Expand Heap After Collection: 0.0ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,phases     ] GC(238)   Other: 57039.6ms
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,heap       ] GC(238) Eden regions: 865->0(867)
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,heap       ] GC(238) Survivor regions: 27->25(112)
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,heap       ] GC(238) Old regions: 20->20
[2021-08-02T11:39:35.179+0800][322547.423s][161325][gc,heap       ] GC(238) Humongous regions: 5->5
[2021-08-02T11:39:35.180+0800][322547.423s][161325][gc,metaspace  ] GC(238) Metaspace: 45664K->45664K(1091584K)
[2021-08-02T11:39:35.180+0800][322547.423s][161325][gc            ] GC(238) Pause Young (Normal) (G1 Evacuation Pause) 916M->48M(1488M) 57045.332ms
[2021-08-02T11:39:35.180+0800][322547.423s][161325][gc,cpu        ] GC(238) User=0.00s Sys=0.09s Real=57.05s
[2021-08-02T11:39:35.180+0800][322547.423s][161325][safepoint     ] Leaving safepoint region
[2021-08-02T11:39:35.180+0800][322547.423s][161325][safepoint     ] Total time for which application threads were stopped: 57.0457609 seconds, Stopping threads took: 0.0000407 seconds

There seems to be a related bug which is not public, unfortunately. I found [this indirect reference](https://hotspot-gc-dev.openjdk.java.narkive.com/2C2uA4QQ/rfr-xxs-jdk-8012525-gc-metaspace-g1addmetaspacedependency-java-test-fails-a-safepoint-timeout) mentioning the bug (`ensure_parsability` is what is executed in the “Prepare TLABs” phase). — Holger, Aug 05 '21 at 09:05
@Holger Thanks, according to the reference, the slow GC pause was highly related with small value of G1UpdateBufferSize(which is 1). But I used the default value which is 256. Should I set a bigger value of **G1UpdateBufferSize**? Any better way to avoid this circumstance？ — JasonYu, Aug 06 '21 at 01:39
I seriously doubt you will get an answer here. My advice is to set `Xlog:gc+tlab=trace` and go to gc dev mailing list. — Eugene, Aug 06 '21 at 03:09
If the problem happened just once, I suppose it has nothing to do with TLABs at all. It's likely just a random moment of time when the JVM process has been suspended by the OS. For example, a common reason for freezes is when the OS runs out of free physical memory. — apangin, Aug 08 '21 at 19:10
@apangin That makes sense either, although it sounds hard to avoid thoroughly. Any system log from `Linux` can help us confirm this？On the other hand, strictly constrain memory usage of the process will help? — JasonYu, Aug 09 '21 at 00:42

Hash Jang · Answer 1 · 2021-08-05T02:59:00.350

As far as I am concerned, the gc log is not enough for finding the root cause.

Some suggestions:

add -Xlog:gc+tlab=debug to show TLAB related logs.
pay attention to the log start with: "TLAB totals: thrds:", refer to the source code of TLAB: threadLocalAllocBuffer.cpp
compare with gc logs to analyse possible problems
if the log showed there were large delay between "TLAB totals: thrds:" and "Pre Evacuate Collection Set", you should dive deeply into the "TLAB totals: thrds:" log and find the cause (the threadnumber is too large, newly created thread and assigned some new objects...)
if the log showed there were nearly no delay between "TLAB totals: thrds:" and "Pre Evacuate Collection Set", you should then switch the log level to -Xlog:gc*=debug to find more info

Some suspicions:

Maybe there were a lot of newly created threads and each of them create at least one object. That may cause the TLAB prepare slow because there would be a lot of TLAB to fill dummy objects.

jdk11 g1gc spend a lot of time on prepare TLABs

1 Answers1