Understanding cause of G1 garbage collector's long pause in one specific GC-phase

Question

I am trying to G1 type garbage collector in a server-side JVM and while, in general, the gc-pauses are quite short, once in a while they shoot up and then we have issues.

I am giving below the break-up of one of such long pauses, which is overall taking 27.79 seconds, of which 23.923 seconds are going for the phase indicated by "Other" in "[Other: 23923.6 ms]".

If someone can throw some light on how I can understand further the cause of this particular GC cycle-phase or how I can improve the GC situation, which seems to be making majority of the time, I'll appreciate it very much.

[GC pause (G1 Evacuation Pause) (young)
Desired survivor size 524288 bytes, new threshold 0 (max 0)
, 25.0957244 secs]
     [Parallel Time: 1087.3 ms, GC Workers: 4]
   [GC Worker Start (ms): Min: 5050906.2, Avg: 5050906.4, Max: 5050906.5, Diff: 0.2]
   [Ext Root Scanning (ms): Min: 1.0, Avg: 190.7, Max: 415.2, Diff: 414.2, Sum: 762.8]
   [Update RS (ms): Min: 308.9, Avg: 565.0, Max: 650.3, Diff: 341.4, Sum: 2259.8]
      [Processed Buffers: Min: 56, Avg: 69.2, Max: 82, Diff: 26, Sum: 277]
   [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
   [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
   [Object Copy (ms): Min: 0.8, Avg: 310.7, Max: 414.8, Diff: 414.0, Sum: 1242.9]
   [Termination (ms): Min: 0.0, Avg: 15.6, Max: 20.8, Diff: 20.8, Sum: 62.4]
      [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
   [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
   [GC Worker Total (ms): Min: 1066.9, Avg: 1082.0, Max: 1087.2, Diff: 20.3, Sum: 4328.2]
   [GC Worker End (ms): Min: 5051973.4, Avg: 5051988.4, Max: 5051993.4, Diff: 20.0]
[Code Root Fixup: 0.2 ms]
[Code Root Purge: 0.0 ms]
[String Dedup Fixup: 2.0 ms, GC Workers: 4]
   [Queue Fixup (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
   [Table Fixup (ms): Min: 0.1, Avg: 0.9, Max: 1.9, Diff: 1.8, Sum: 3.7]
[Clear CT: 82.6 ms]
[Other: 23923.6 ms]
   [Choose CSet: 0.0 ms]
   [Ref Proc: 415.2 ms]
   [Ref Enq: 0.0 ms]
   [Redirty Cards: 0.1 ms]
   [Humongous Register: 0.1 ms]
   [Humongous Reclaim: 0.1 ms]
   [Free CSet: 0.2 ms]
[Eden: 119.0M(119.0M)->0.0B(133.0M) Survivors: 0.0B->0.0B Heap: 572.3M(2391.0M)->456.2M(2666.0M)]
[Times: user=0.00 sys=1.29, real=27.79 secs]

the8472 · Answer 1 · 2018-06-26T18:32:48.650

1

[Times: user=0.00 sys=1.29, real=27.79 secs]

It spends practically no CPU cycles at all on garbage collection and all the time in the kernel and even more wall time, suggesting the process is blocked in the kernel doing things. The primary suspect is your system swapping.

Other potential causes:

the system being CPU-starved by some other process
IO being blocked by an idle harddrive having to spin up
quotas being exceeded - especially in containerized or VM environments
THP compactation - although this is unlikely as it normally manifests through higher kernel load

You can also try increasing GC log verbosity with -XX:G1LogLevel=fine/finer/finest and see if that points at anything particular, but the symptoms point at an system-level problem and nothing VM-specific.

edited Jun 26 '18 at 18:32

answered Jun 23 '18 at 06:04

the8472

40,999
5
70
122

Thanks for replying. What would be the best way to observe the extent of system swapping at that time, if you'd know please? – Roshan Jun 23 '18 at 16:04
`vmstat`. for general sysadmin questions you probably should ask on superuser.com – the8472 Jun 24 '18 at 06:15
@Roshan Swapping is too bad and it's pretty probable here (another reason might be CPU busy with other work). Get more memory for the process or decrease its size. – maaartinus Jun 25 '18 at 01:16
@maaartinus but it doesn't really seem short of memory yet, does it? It shows "Heap: 572.3M(2391.0M)->456.2M(2666.0M)", which means it's using about 1/4-15th of the current heap of 2 GB. And it's allocated only 2 GB out of 4 GB given to it. Or, you meant something else? – Roshan Jun 25 '18 at 06:46
@Roshan I'm just trying to guess, what's going on. Swapping or CPU busy or something else? Can you try `swapoff -a`? Note that some processes will get killed instead of swapping, but it's quick test. `+++` You're using `-Xmm2G` right? That's how big your heap may grow and that's fine. There's some more memory used by the JVM, but this is usually no problem. Are you using a lot of non-heap memory (used by JNI, `ByteBuffer`s, ...)? How did you give 4 GB to it? – maaartinus Jun 25 '18 at 13:45
@maaartinus I give it Xms as 2G and Xmx as 4G, so it starts off with heap 2G assigned to it that can grow to 4G, if needed. At the point where it took so long in my GC output, the heap used was like 1/4th of that initial 2G. So, I was trying to say that heap didn't seem to be issue and hence getting more memory probably wasn't hitting the cause. – Roshan Jun 25 '18 at 16:34
@Roshan With `-Xms2G -Xmx4G` you said the VM how much memory it should use for its heap. This doesn't mean that the OS can give it that much memory to it without swapping. Given `[Times: user=0.00 sys=1.29, real=27.79 secs]`, it'd bet, it was swapping, but I may be wrong. – maaartinus Jun 25 '18 at 16:48
1

@maaartinus I'll check for the swappiness next time long GC pauses occur, to be 100% sure, although I am pretty sure that the memory available is not an issue. I am trying to use 2 - 4 GB on a 15 GB machine and it doesn't have any other heavy processes running. – Roshan Jun 26 '18 at 01:52
added a few things you can investigate. anyway, this is all guessing, the data you provided doesn't yield much insights. – the8472 Jun 26 '18 at 18:33
Thanks @the8472. Just noticed your comment and edits. Will be spending more time on it this weekend. – Roshan Jun 29 '18 at 06:59

Understanding cause of G1 garbage collector's long pause in one specific GC-phase

1 Answers1