1

In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.

Observation: Cluster Write latency (99th percentile) - 4Sec Local Write latency (99th percentile) - 10ms Read & Write consistency - local_one Total nodes - 7

I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication

 session_id                           | event_id                             | activity                                                                                                                    | source        | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing  SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;'  | cassandranode3 |              7 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c |                                                                                                         Preparing statement | Cassandranode3 |             47 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c |                                                                                            reading data from /Cassandranode1 | Cassandranode3 |            121 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c |                                                                       REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 |          40614 | MessagingService-Incoming-/Cassandranode1
 4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c |                                                                                     Processing response from /Cassandranode1 | Cassandranode3 |          40626 |                      SharedPool-Worker-5

I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.

Warning in the system.log:

WARN  [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]

During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.

cassandra@cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.

I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.

Has anyone faced similar issues ? any suggestions ?

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • 1
    source_elapsed is in micros, not millis. That trace is 40ms – Chris Lohfink Sep 19 '18 at 01:18
  • a 4 second delay is most likely from GCs, the really bad GCs (over 1 second) are usually from too small heaps, misconfigured heaps, or bad data models. – Chris Lohfink Sep 19 '18 at 01:20
  • @ChrisLohfink - there are baremetal nodes with 48 cores and 192 GB RAM. We use G1GC with the following config: MAX_HEAP_SIZE="48G" HEAP_NEWSIZE="6G" JVM_OPTS="$JVM_OPTS -XX:+CMSClassUnloadingEnabled" JVM_OPTS="$JVM_OPTS -XX:+UseThreadPriorities" JVM_OPTS="$JVM_OPTS -XX:ThreadPriorityPolicy=42" JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}" JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}" JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError" JVM_OPTS="$JVM_OPTS -Xss256k" JVM_OPTS="$JVM_OPTS -XX:StringTableSize=1000003" – Sinister8774 Sep 19 '18 at 02:53
  • JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5" JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=200" JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=32" JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=10" JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=50" – Sinister8774 Sep 19 '18 at 02:56
  • I can see few instances of GC taking lot of time ~20ms : {Heap before GC invocations=51501 (full 648): garbage-first heap total 50331648K, used 29103665K [0x00002b54fa000000, 0x00002b54fb006000, 0x00002b60fa000000) region size 16384K, 1420 young (23265280K), 55 survivors (901120K) Metaspace used 36912K, capacity 37584K, committed 37632K, reserved 38912K 2018-09-19T04:35:42.275+0530: 546488.700: [Full GC (System.gc()) 27G->4279M(48G), 20.1535506 secs] – Sinister8774 Sep 19 '18 at 03:03
  • [Eden: 21.3G(24.5G)->0.0B(28.8G) Survivors: 880.0M->0.0B Heap: 27.8G(48.0G)->4279.4M(48.0G)], [Metaspace: 36912K->36912K(38912K)] Heap after GC invocations=51502 (full 649): garbage-first heap total 50331648K, used 4382114K [0x00002b54fa000000, 0x00002b54fb006000, 0x00002b60fa000000) region size 16384K, 0 young (0K), 0 survivors (0K) Metaspace used 36912K, capacity 37584K, committed 37632K, reserved 38912K } [Times: user=25.41 sys=0.00, real=20.16 secs] 2018-09-19T04:36:02.429+0530: 546508.854: Total time for which application threads were stopped: 20.1728969 seconds – Sinister8774 Sep 19 '18 at 03:04
  • @ChrisLohfink - do you have any suggestions to tune GC – Sinister8774 Sep 19 '18 at 03:04

1 Answers1

2

(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.

Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • My current config: CommandLine flags: -XX:CICompilerCount=15 -XX:+CMSClassUnloadingEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSParallelInitialMarkEnabled -XX:CMSWaitDuration=10000 -XX:ConcGCThreads=10 -XX:G1HeapRegionSize=16777216 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:GCLogFileSize=10485760 -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=51539607552 -XX:InitiatingHeapOccupancyPercent=50 -XX:+ManagementServer -XX:MarkStackSize=4194304 -XX:MaxGCPauseMillis=200 -XX:MaxHeapSize=51539607552 -XX:MaxNewSize=30920409088 -XX:MinHeapDeltaBytes=16777216 – Sinister8774 Sep 19 '18 at 12:16
  • XX:NumberOfGCLogFiles=10 -XX:ParallelGCThreads=32 -XX:PrintFLSStatistics=1 -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution -XX:StringTableSize=1000003 -XX:ThreadPriorityPolicy=42 -XX:ThreadStackSize=256 -XX:+UseCondCardMark -XX:+UseFastUnorderedTimeStamps -XX:+UseG1GC -X:+UseGCLogFileRotation -XX:+UseThreadPriorities – Sinister8774 Sep 19 '18 at 12:17