This is a situation for the company that I work for, that's causing to loose my sleep, since I'm in front of the wall and finding no options. I have a production cluster of 51 cassandra 3.11.9 nodes, that works with a large load (from 600 to 800 GB each node), that is presenting a very annoying issue: For an unknown reason, there machines start doing GC pauses, that really hurt the response time of my application, since the consistency level on clients is ONE,creating peaks of latency.
Here are some samples of:
28 | CHANGED | rc=0 >>
INFO [Service Thread] 2023-07-14 01:56:45,583 GCInspector.java:285 - G1 Old Generation GC in 12039ms. G1 Old Gen: 54648149088 -> 12146746552; G1 Survivor Space: 134217728 -> 0;
INFO [Service Thread] 2023-07-14 02:14:24,394 GCInspector.java:285 - G1 Old Generation GC in 57918ms. G1 Old Gen: 67780410096 -> 59704216816; Metaspace: 61436792 -> 61302392
INFO [Service Thread] 2023-07-14 02:15:44,506 GCInspector.java:285 - G1 Old Generation GC in 64576ms. G1 Old Gen: 67971190408 -> 64736391536;
INFO [Service Thread] 2023-07-14 02:17:06,520 GCInspector.java:285 - G1 Old Generation GC in 66242ms. G1 Old Gen: 68043573704 -> 66792790424;
INFO [Service Thread] 2023-07-14 02:21:31,210 GCInspector.java:285 - G1 Old Generation GC in 257268ms. G1 Old Gen: 68046631936 -> 67703054448;
254 | CHANGED | rc=0 >>
INFO [Service Thread] 2023-07-14 02:36:26,170 GCInspector.java:285 - G1 Old Generation GC in 46654ms. G1 Old Gen: 133621345752 -> 49403423024; Metaspace: 67436096 -> 67339688
INFO [Service Thread] 2023-07-14 02:38:58,627 GCInspector.java:285 - G1 Old Generation GC in 89392ms. G1 Old Gen: 133594285096 -> 103157948104;
INFO [Service Thread] 2023-07-14 02:40:59,754 GCInspector.java:285 - G1 Old Generation GC in 93345ms. G1 Old Gen: 135071359720 -> 105377369048; G1 Survivor Space: 33554432 -> 0;
INFO [Service Thread] 2023-07-14 02:43:29,171 GCInspector.java:285 - G1 Old Generation GC in 106174ms. G1 Old Gen: 133812654600 -> 119264140552; G1 Survivor Space: 234881024 -> 0;
INFO [Service Thread] 2023-07-14 02:45:36,900 GCInspector.java:285 - G1 Old Generation GC in 95625ms. G1 Old Gen: 135225564784 -> 99943593104;
INFO [Service Thread] 2023-07-14 02:46:53,820 GCInspector.java:285 - G1 Old Generation GC in 55875ms. G1 Old Gen: 133359614104 -> 60924511688; G1 Survivor Space: 872415232 -> 0;
INFO [Service Thread] 2023-07-14 02:48:22,803 GCInspector.java:285 - G1 Old Generation GC in 38493ms. G1 Old Gen: 133978126912 -> 36277631424;
INFO [Service Thread] 2023-07-14 02:50:11,320 GCInspector.java:285 - G1 Old Generation GC in 34789ms. G1 Old Gen: 134004918888 -> 35377344368;
250 | CHANGED | rc=0 >>
INFO [Service Thread] 2023-07-14 00:18:52,262 GCInspector.java:285 - G1 Old Generation GC in 96017ms. G1 Old Gen: 73628910144 -> 59159105432; Metaspace: 58018496 -> 57907432
INFO [Service Thread] 2023-07-14 00:46:41,400 GCInspector.java:285 - G1 Old Generation GC in 30177ms. G1 Old Gen: 41448088568 -> 24094354384; G1 Survivor Space: 67108864 -> 0;
INFO [Service Thread] 2023-07-14 02:18:34,910 GCInspector.java:285 - G1 Old Generation GC in 40940ms. G1 Old Gen: 74016882928 -> 27759131352; Metaspace: 57315192 -> 57128720
INFO [Service Thread] 2023-07-14 02:36:02,256 GCInspector.java:285 - G1 Old Generation GC in 57658ms. G1 Old Gen: 73488401080 -> 40838191112; Metaspace: 54701984 -> 54651552
INFO [Service Thread] 2023-07-14 02:37:47,374 GCInspector.java:285 - G1 Old Generation GC in 87036ms. G1 Old Gen: 73498188264 -> 65920831896;
INFO [Service Thread] 2023-07-14 02:39:58,921 GCInspector.java:285 - G1 Old Generation GC in 111435ms. G1 Old Gen: 73496794000 -> 70079092144;
In the latest months I've tried several things, like:
- Increasing the instance type (increasing the JVM), but the error just get more time to happens, but happens anyway.
- Removing nodes that presenting this issue, but new nodes started having the problem
- Using G1GC and different JVM vendors, like the one from AZUL
Currently, I don't have the following options:
- Changing the wiring or the data modeling, since that's depends on other teams.
- Updating Cassandra to 4 version, the application needs to be updated first.
Right now, the only resource that I have is using "disablebinary" on these nodes for avoiding peaks, but that it's not good.
The JVM
INFO [main] 2023-07-17 18:40:11,668 CassandraDaemon.java:507 - JVM Arguments: [-javaagent:/opt/simility/include/exporters/jmxexporter/jmx_prometheus_javaagent-0.12.0.jar=7070:/opt/simility/include/exporters/jmxexporter/cassandra.yml, -ea, -javaagent:/opt/simility/include/cassandra/lib/jamm-0.3.0.jar, -XX:+UseThreadPriorities, -XX:ThreadPriorityPolicy=42, -Xms64G, -Xmx64G, -XX:+ExitOnOutOfMemoryError, -Xss256k, -XX:StringTableSize=1000003, -XX:+UseG1GC, -XX:G1RSetUpdatingPauseTimePercent=5, -XX:MaxGCPauseMillis=500, -Djava.net.preferIPv4Stack=true, -Dcassandra.config=file:///opt/simility/conf/cassandra/cassandra.yaml, -Djava.rmi.server.hostname=172.33.135.28, -Dcom.sun.management.jmxremote.port=7199, -Dcom.sun.management.jmxremote.rmi.port=7199, -Dcom.sun.management.jmxremote.ssl=false, -Dcom.sun.management.jmxremote.authenticate=false, -Dcassandra.libjemalloc=/lib64/libjemalloc.so.1, -Dlogback.configurationFile=logback.xml, -Dcassandra.config=file:///opt/simility/conf/cassandra/cassandra.yaml, -Dcassandra.logdir=/opt/simility/log/cassandra, -Dcassandra.storagedir=/opt/simility/include/cassandra/data]
Thank you!
several things, as explained in the last message