3

So, we're using G1 GC. and 18GB heap. Young generation size is about 3,5G. And heap maximum usage is about 12G. And memory is full of short-living objects.

Also, maybe it's important that Couchbase instance is running on the same node near the JVM. It takes all iops often when it's persisting changes on hdd, but there are enough free cpu time and memory.

Enabled JVM options:

-Xmx18g -Xms18g -XX:MaxPermSize=512M -XX:+UseG1GC -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled

My sad problem is young generation GC pauses. Long pause GC log usually looks like:

2013-07-10T15:06:25.963+0400: 9122,066: [GC pause (young)
Desired survivor size 243269632 bytes, new threshold 5 (max 15)
        - age   1:   69789280 bytes,   69789280 total
- age   2:   58618240 bytes,  128407520 total
- age   3:   54519720 bytes,  182927240 total
- age   4:   51592728 bytes,  234519968 total
- age   5:   45186520 bytes,  279706488 total
9122,066: [G1Ergonomics (CSet Construction) start choosing CSet, predicted base time: 174,16 ms, remaining time: 25,84 ms, target pause time: 200,00 ms]
        9122,066: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 426 regions, survivors: 34 regions, predicted young region time: 164,97 ms]
        9122,066: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 426 regions, survivors: 34 regions, old: 0 regions, predicted pause time: 339,13 ms, target pause time: 200,00 ms]
        9122,259: [SoftReference, 0 refs, 0,0063780 secs]9124,575: [WeakReference, 4 refs, 0,0031600 secs]9124,578: [FinalReference, 1640 refs, 0,0033730 secs]9124,581: [PhantomReference, 145 refs, 0,0032080 secs]9124,5
        85: [JNI Weak Reference, 0,0000810 secs], 2,53669600 secs]
        [Parallel Time: 190,5 ms]
        [GC Worker Start (ms):  9122066,6  9122066,7  9122066,7  9122066,7  9122066,8  9122066,9  9122066,9  9122066,9  9122066,9  9122067,0  9122067,0  9122067,1  9122067,1  9122067,1  9122067,1  9122067,2  91220
        67,2  9122067,3
Avg: 9122067,0, Min: 9122066,6, Max: 9122067,3, Diff:   0,7]
        [Ext Root Scanning (ms):  4,7  6,0  4,8  4,5  4,2  4,3  4,2  4,3  4,6  3,4  13,5  5,2  4,2  5,6  4,2  4,1  4,3  4,0
Avg:   5,0, Min:   3,4, Max:  13,5, Diff:  10,1]
        [Update RS (ms):  20,9  19,6  21,1  21,3  21,2  21,2  21,3  21,2  21,7  21,5  12,1  20,2  21,1  19,4  21,0  21,1  20,7  21,2
Avg:  20,4, Min:  12,1, Max:  21,7, Diff:   9,6]
        [Processed Buffers : 27 23 25 29 31 22 25 34 28 14 36 23 24 22 28 24 25 24
Sum: 464, Avg: 25, Min: 14, Max: 36, Diff: 22]
        [Scan RS (ms):  9,0  9,2  8,7  8,8  9,1  9,1  8,9  9,1  8,3  9,2  9,0  9,1  9,2  9,2  9,1  9,0  9,0  9,1
Avg:   9,0, Min:   8,3, Max:   9,2, Diff:   1,0]
        [Object Copy (ms):  145,1  145,0  145,2  145,1  145,1  144,9  145,1  144,9  144,9  145,4  144,8  144,8  144,8  145,0  145,0  145,1  145,2  144,9
Avg: 145,0, Min: 144,8, Max: 145,4, Diff:   0,6]
        [Termination (ms):  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0
Avg:   0,0, Min:   0,0, Max:   0,0, Diff:   0,0]
        [Termination Attempts : 5 8 2 11 5 6 6 5 5 7 4 7 2 9 8 5 7 8
Sum: 110, Avg: 6, Min: 2, Max: 11, Diff: 9]
        [GC Worker End (ms):  9122246,4  9122246,6  9122246,7  9122246,6  9122246,7  9122246,7  9122246,5  9122246,7  9122246,5  9122246,5  9122246,6  9122246,7  9122246,8  9122246,4  9122246,6  9122246,5  9122246
        ,7  9122246,8
Avg: 9122246,6, Min: 9122246,4, Max: 9122246,8, Diff:   0,3]
        [GC Worker (ms):  179,8  179,9  180,0  179,8  179,9  179,9  179,6  179,8  179,6  179,5  179,6  179,6  179,7  179,3  179,5  179,4  179,4  179,5
Avg: 179,7, Min: 179,3, Max: 180,0, Diff:   0,7]
        [GC Worker Other (ms):  10,7  10,7  10,8  10,8  10,9  10,9  11,0  11,0  11,0  11,1  11,1  11,1  11,2  11,2  11,2  11,2  11,3  11,3
Avg:  11,0, Min:  10,7, Max:  11,3, Diff:   0,6]
        [Clear CT:   2,8 ms]
        [Other: 2343,4 ms]
        [Choose CSet:   0,1 ms]
        [Ref Proc: 2327,7 ms]
        [Ref Enq:   1,9 ms]
        [Free CSet:   8,2 ms]
        [Eden: 3408M(3408M)->0B(3400M) Survivors: 272M->280M Heap: 9998M(18432M)->6638M(18432M)]
        [Times: user=3,26 sys=0,02, real=2,54 secs]
Total time for which application threads were stopped: 2,5434370 seconds

The only phase of GC which makes problems is 'Reference processing'. But the log looks strange: soft, weak, final and jni references processing took very little time. But overall time is 2,5 seconds. Or it can be even more, up to 10 seconds in worst cases.

Another pause (more comfortable) may look like:

2013-07-10T16:26:11.862+0400: 13907,965: [GC pause (young)
    Desired survivor size 243269632 bytes, new threshold 4 (max 15)
    - age   1:   69125832 bytes,   69125832 total
    - age   2:   58756480 bytes,  127882312 total
    - age   3:   52397376 bytes,  180279688 total
    - age   4:   88850424 bytes,  269130112 total
    13907,965: [G1Ergonomics (CSet Construction) start choosing CSet, predicted base time: 77,38 ms, remaining time: 122,62 ms, target pause time: 200,00 ms]
    13907,965: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 427 regions, survivors: 33 regions, predicted young region time: 167,95 ms]
    13907,965: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 427 regions, survivors: 33 regions, old: 0 regions, predicted pause time: 245,33 ms, target pause time: 200,00 ms]
    13908,155: [SoftReference, 0 refs, 0,0041340 secs]13908,160: [WeakReference, 0 refs, 0,0023850 secs]13908,162: [FinalReference, 1393 refs, 0,0065970 secs]13908,169: [PhantomReference, 108 refs, 0,0018650 secs]13908,171: [JNI Weak Reference, 0,0000630 secs], 0,22008100 secs]
    [Parallel Time: 188,4 ms]
    [GC Worker Start (ms):  13907965,3  13907965,3  13907965,4  13907965,4  13907965,5  13907965,5  13907965,6  13907965,6  13907965,6  13907965,7  13907965,7  13907965,7  13907965,8  13907965,8  13907965,8  13907965,9  13907965,9  13907965,9
    Avg: 13907965,6, Min: 13907965,3, Max: 13907965,9, Diff:   0,6]
    [Ext Root Scanning (ms):  5,8  5,0  6,8  6,3  6,1  6,2  6,0  6,3  5,2  4,2  5,0  6,2  4,5  6,0  17,1  4,4  6,2  5,3
    Avg:   6,3, Min:   4,2, Max:  17,1, Diff:  12,9]
    [Update RS (ms):  24,8  26,0  23,9  24,1  24,1  24,1  24,2  23,9  25,0  25,2  25,1  24,1  26,0  24,3  13,7  25,7  24,2  24,7
    Avg:  24,1, Min:  13,7, Max:  26,0, Diff:  12,2]
    [Processed Buffers : 30 20 9 16 16 19 20 21 22 12 30 17 17 20 12 20 17 22
    Sum: 340, Avg: 18, Min: 9, Max: 30, Diff: 21]
    [Scan RS (ms):  7,5  7,1  7,2  7,5  7,6  7,5  7,5  7,6  7,1  7,4  7,6  7,2  7,2  7,4  7,2  7,5  7,0  7,7
    Avg:   7,4, Min:   7,0, Max:   7,7, Diff:   0,7]
    [Object Copy (ms):  133,1  133,1  133,2  133,1  133,2  133,1  133,2  133,1  133,5  134,0  133,0  133,2  133,0  132,9  132,6  133,1  133,2  132,9
    Avg: 133,1, Min: 132,6, Max: 134,0, Diff:   1,3]
    [Termination (ms):  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0  0,0
    Avg:   0,0, Min:   0,0, Max:   0,0, Diff:   0,0]
    [Termination Attempts : 1 3 1 1 1 1 1 1 1 1 2 3 2 1 1 1 1 1
    Sum: 24, Avg: 1, Min: 1, Max: 3, Diff: 2]
    [GC Worker End (ms):  13908136,6  13908136,9  13908136,5  13908136,7  13908136,7  13908136,8  13908136,7  13908136,7  13908136,8  13908136,8  13908136,5  13908136,6  13908136,5  13908136,5  13908136,5  13908136,5  13908136,8  13908136,6
    Avg: 13908136,7, Min: 13908136,5, Max: 13908136,9, Diff:   0,4]
    [GC Worker (ms):  171,3  171,6  171,1  171,2  171,2  171,3  171,1  171,1  171,1  171,2  170,8  170,9  170,7  170,7  170,7  170,6  171,0  170,7
    Avg: 171,0, Min: 170,6, Max: 171,6, Diff:   0,9]
    [GC Worker Other (ms):  17,2  17,2  17,3  17,3  17,4  17,4  17,5  17,5  17,5  17,5  17,6  17,6  17,7  17,7  17,7  17,7  17,8  17,8
    Avg:  17,5, Min:  17,2, Max:  17,8, Diff:   0,6]
    [Clear CT:   1,6 ms]
    [Other:  30,1 ms]
    [Choose CSet:   0,1 ms]
    [Ref Proc:  17,1 ms]
    [Ref Enq:   0,9 ms]
    [Free CSet:   7,4 ms]
    [Eden: 3416M(3416M)->0B(3456M) Survivors: 264M->224M Heap: 7289M(18432M)->3912M(18432M)]
    [Times: user=3,16 sys=0,00, real=0,22 secs]    

Reference processing is still longest phase, but it's much shorter. ParallelRefProcEnabled wasn't be a cure for my problem. I've also tried to change size of young gen. It also not helped. Setting different -XX:MaxGCPauseMillis, more relaxed 600ms or more strict 100, results in still bad throughput.

CMS performance is even worse than G1 with parameters:

-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly

Young gen collections going longer and more often.

I'm totally confused with described logs. Tomorrow I'll try to move Couchbase instance to another node in order to check if it's freezes GC threads.

But, if Couchbase is not the point, maybe someone could explain me meaning of the logs. Or maybe there are some magic CMS parameters to fix that thing.

I'll be very glad for any help!

Anton Maximov
  • 251
  • 2
  • 6
  • Do you have GC logs from CMS collector with +XX:+PrintReferenceGC option enabled? – Alexey Ragozin Jul 10 '13 at 20:59
  • Yeah, sure. This part of log is about it: 13908,155: [SoftReference, 0 refs, 0,0041340 secs]13908,160: [WeakReference, 0 refs, 0,0023850 secs]13908,162: [FinalReference, 1393 refs, 0,0065970 secs]13908,169: [PhantomReference, 108 refs, 0,0018650 secs]13908,171: [JNI Weak Reference, 0,0000630 secs], 0,22008100 secs] [Parallel Time: 188,4 ms] – Anton Maximov Jul 11 '13 at 04:45
  • 1
    Actually it seems we've fixed our problem by moving couchbase to another machine. Now 99% of pauses are lesser than 0,25 sec. But we still have about 2000 of final links on every 0,1-0,2 sec GC pause. Maybe I should do something with that to make it even better. – Anton Maximov Jul 11 '13 at 04:50
  • Sounds like you push the memory limit when you have JVM and CouchBase on the same machine? – Nat Oct 26 '17 at 19:33

1 Answers1

2

Problem was fixed by ourselves. I we've got a new rule - NEVER install Couchbase near JVM. In the past we had such problem with instance Postgresql conflicting with Couchbase because Couchbase likes to grab all disk ops and Postgresql cannot commit anything.

So, isolate Couchbase and everything gonna be allright.

Anton Maximov
  • 251
  • 2
  • 6