We have a 6 node Cassandra Cluster under heavy utilization. We have been dealing a lot with garbage collector stop the world event, which can take up to 50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not even accepting new logins.
Extra details:
- Cassandra Version: 3.11
- Heap Size = 12 GB
- We are using G1 Garbage Collector with default settings
- Nodes size: 4 CPUs 28 GB RAM
- The G1 GC behavior is the same across all nodes.
Any help would be very much appreciated!
Edit 1:
Checking object creation stats, it does not look healthy at all.
Edit 2:
I have tried to use the suggested settings by Chris Lohfink, here is the GC report:
Using CMS suggested settings http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=
Using G1 suggested settings http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3
The behavior remains basically the same:
- Old Gen starts to fill up.
- GC can't clean it properly without a full GC and a STW event.
- The full GC starts to take longer, until the node is completely unresponsive.
I'm going to get the cfstats output for maximum partition size and tombstones per read asap and edit the post again.