2

We have a 6 node Cassandra Cluster under heavy utilization. We have been dealing a lot with garbage collector stop the world event, which can take up to 50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not even accepting new logins.

Extra details:

  • Cassandra Version: 3.11
  • Heap Size = 12 GB
  • We are using G1 Garbage Collector with default settings
  • Nodes size: 4 CPUs 28 GB RAM
  • The G1 GC behavior is the same across all nodes.

Any help would be very much appreciated!

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here


Edit 1:

Checking object creation stats, it does not look healthy at all.

enter image description here


Edit 2:

I have tried to use the suggested settings by Chris Lohfink, here is the GC report:

Using CMS suggested settings http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=

Using G1 suggested settings http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3

The behavior remains basically the same:

  1. Old Gen starts to fill up.
  2. GC can't clean it properly without a full GC and a STW event.
  3. The full GC starts to take longer, until the node is completely unresponsive.

I'm going to get the cfstats output for maximum partition size and tombstones per read asap and edit the post again.

Scudeler
  • 91
  • 1
  • 7
  • 1
    heap after GC is increasing, so either your application simply needs more memory, you have a leak or cassandra is configured in a way that bursts allocations in a manner that G1 can't keep up with. Those cases can't be distinguished from those charts alone. – the8472 Oct 04 '17 at 17:56
  • 1
    whats your current GC settings? – Chris Lohfink Oct 04 '17 at 19:28
  • 1
    can you include your cfstats output for maximum partition size and tombstones per read? Scanning over the tombstone and deserializing a large partition index are common causes for the high obj allocation rates. also above comment, hard to tell how to improve your GCs without knowing current settings – Chris Lohfink Oct 06 '17 at 16:46
  • @ChrisLohfink I was trying to use the default G1 GC settings, I did play arround with the XX:MaxGCPauseMillis, but nothing changed. I edited the post with GC reports using the recommended settings for G1 and CMS and I'm going to get the info you asked ASAP. – Scudeler Oct 08 '17 at 19:20
  • @the8472 it looks like a memory leak (https://plumbr.eu/blog/memory-leaks/memory-leaks-fallacies-and-misconceptions), could you give me an example of what Cassandra settings I could check? – Scudeler Oct 08 '17 at 19:24
  • Can you share your schema and cfstats? This looks kinda like a wide partition or tombstone issue for it to create so much garbage. A heap dump would be most telling but they are rather large and hard to share. – Chris Lohfink Oct 09 '17 at 14:57

2 Answers2

3

Have you looked at using Zing? Cassandra situations like these are a classic use case, as Zing fundamentally eliminates all GC-related glitches in Cassandra nodes and clusters.

You can see some details on the how/why in my recent "Understanding GC" talk from JavaOne (https://www.slideshare.net/howarddgreen/understanding-gc-javaone-2017). Or just skip to slides 56-60 for Cassandra-specific results.

Gil Tene
  • 798
  • 5
  • 8
2

Without knowing what your existing settings or possible data model problems, heres a guess of some conservative settings to use to try to reduce evacuation pauses from not having enough to-space (check gc logs):

-Xmx12G -Xms12G -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:-ReduceInitialCardMarks -XX:G1HeapRegionSize=32m

This should also help reduce the pause of the update remember set which becomes an issue and reducing humongous objects, by setting G1HeapRegionSize, which can become a problem depending on data model. Make sure -Xmn is not set.

12Gb with C* is probably more suited for using CMS for what its worth, you can get better throughput certainly. Just need to be careful of fragmentation over time with the rather large objects that can get allocated.

-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=55 -XX:MaxTenuringThreshold=3 -Xmx12G -Xms12G -Xmn3G -XX:+CMSEdenChunksRecordAlways -XX:+CMSParallelInitialMarkEnabled -XX:+CMSParallelRemarkEnabled -XX:CMSWaitDuration=10000 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCondCardMark 

Most likely theres an issue with data model or your under provisioned though.

Brad Schoening
  • 1,281
  • 6
  • 22
Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38