G1GC long pause in Cassandra causing dropped mutation

Question

I am running 3 DC with 10 nodes Cassandra 3.0.11 cluster.

I frequently see the following messages

WARN  [Service Thread] 2021-02-10 14:03:10,219  GCInspector.java:282 - G1 Young Generation GC in 1317ms.  G1 Eden Space: 4546625536 -> 0; G1 Old Gen: 22573336584 -> 25140250632; G1 Survivor Space: 1124073472 -> 721420288;
WARN  [Service Thread] 2021-02-10 14:03:11,916  GCInspector.java:282 - G1 Young Generation GC in 1382ms.  G1 Eden Space: 989855744 -> 0; G1 Old Gen: 25140250632 -> 26364987400; G1 Survivor Space: 721420288 -> 218103808;
WARN  [Service Thread] 2021-02-10 14:03:49,801  GCInspector.java:282 - G1 Young Generation GC in 1072ms.  G1 Eden Space: 4496293888 -> 0; G1 Old Gen: 17078798632 -> 19586992416; G1 Survivor Space: 620756992 -> 654311424;
WARN  [Service Thread] 2021-02-10 14:03:51,471  GCInspector.java:282 - G1 Young Generation GC in 1336ms.  G1 Eden Space: 1056964608 -> 0; G1 Old Gen: 19586992416 -> 20870449448; G1 Survivor Space: 654311424 -> 218103808;
WARN  [Service Thread] 2021-02-10 14:04:42,262  GCInspector.java:282 - G1 Young Generation GC in 8909ms.  G1 Eden Space: 1493172224 -> 0; G1 Old Gen: 32195070248 -> 34099284256;
WARN  [Service Thread] 2021-02-10 14:04:44,990  GCInspector.java:282 - G1 Young Generation GC in 2520ms.  G1 Old Gen: 34099284256 -> 34317388064; G1 Survivor Space: 218103808 -> 0;
WARN  [Service Thread] 2021-02-10 14:04:47,245  GCInspector.java:282 - G1 Old Generation GC in 28836ms.  G1 Old Gen: 34317388064 -> 11666582136; Metaspace: 49839232 -> 49835448

I am using G1GC with 32Gb of Heap. Due to this I am often seeing dropped mutation

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0     1747789164         0                 0
ViewMutationStage                 0         0              0         0                 0
ReadStage                         0         0       12399767         0                 0
RequestResponseStage              0         0      627930907         0                 0
ReadRepairStage                   0         0          60775         0                 0
CounterMutationStage              0         0              0         0                 0
MiscStage                         0         0              0         0                 0
CompactionExecutor                0         0        2101437         0                 0
MemtableReclaimMemory             0         0           4381         0                 0
PendingRangeCalculator            0         0             66         0                 0
GossipStage                       0         0        1350977         0                 0
SecondaryIndexManagement          0         0              0         0                 0
HintsDispatcher                   0         0          11394         0                 0
MigrationStage                    0         0         207917         0                 0
MemtablePostFlush                 0         0           3667         0                 0
ValidationExecutor                0         0              0         0                 0
Sampler                           0         0              0         0                 0
MemtableFlushWriter               0         0           2926         0                 0
InternalResponseStage             0         0         420120         0                 0
AntiEntropyStage                  0         0              0         0                 0
CacheCleanupExecutor              0         0              0         0                 0
Native-Transport-Requests         3         0     3503749628         0          12323589

Message type           Dropped
READ                     66919
RANGE_SLICE               8260
_TRACE                       0
HINT                   2208871
MUTATION               5207285
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE         16491
PAGED_RANGE                  0
READ_REPAIR                  9

I have tried using sjk tool and most often I see sharedworker-pool

Monitoring threads ...
2021-02-10T14:01:27.672-0700 Process summary
  process cpu=355.30%
  application cpu=362.09% (user=322.78% sys=39.31%)
  other: cpu=-6.79%
  thread count: 823
  heap allocation rate 1168mb/s
[000642] user=26.57% sys= 0.86% alloc=  119mb/s - SharedPool-Worker-10
[000647] user=23.41% sys= 0.93% alloc=  115mb/s - SharedPool-Worker-12
[000636] user=25.83% sys= 2.34% alloc=  111mb/s - SharedPool-Worker-4
[000634] user=20.25% sys= 0.27% alloc=  100mb/s - SharedPool-Worker-2
[000652] user=19.14% sys= 0.17% alloc=   99mb/s - SharedPool-Worker-19
[000648] user=19.14% sys= 0.19% alloc=   98mb/s - SharedPool-Worker-16
[000637] user=21.00% sys= 0.25% alloc=   94mb/s - SharedPool-Worker-5
[000633] user=12.82% sys= 2.51% alloc=   32mb/s - SharedPool-Worker-1
[000654] user= 7.25% sys= 0.76% alloc=   31mb/s - SharedPool-Worker-20

What is the best way to check what's causing the heap to fillip and causing GC?

update CPU Info

 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             32
NUMA node(s):          32
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel Core Processor (Broadwell, IBRS)
Stepping:              2
CPU MHz:               2095.320
BogoMIPS:              4190.64
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K

Total RAM

176GB

Clients

 sudo netstat | grep 9042 | grep ESTABLISHED| wc -l
295

you need to provide more information about hardware, how much memory is allocated for heap, etc. Plus, how many client, number of transactions per second, etc. — Alex Ott, Feb 11 '21 at 07:51
Hello @AlexOtt, I have updated the problem with the info minus transactions per second. Is there a way to quickly get that? — developthou, Feb 11 '21 at 21:13

G1GC long pause in Cassandra causing dropped mutation

0 Answers0