On May 3rd we deployed. Our 3 node cassandra cluster became extremely slow and many web requests were timing out. By EOD May 3rd we launched another m1.large machine to our cluster which resolved the timeouts. That being said, the cluster was still extremely slow; on May 4th we launched five i3.xLarge nodes. This helped our application response time considerably, on May 5th we removed the old m1.large boxes from the cluster. As of EOD May 5th everything was quick and responsive. This morning, the application began timing out again.
We have noticed some odd CPU utilization behaviour - the CPU usage fluctuates between 100% and 200% regardless of load (they are four core machines). We have extremely light weekends with absolutely no load and relatively heavy monday loads but we are seeing absolutely no change in CPU usage.
As you can see in the 2 week graph below, our database CPU usage was once bound to application usage. You can see the large spike in the 3rd, the introduction of the new machines an the 4th, and the stable high CPU usage starting on the 6th.
We have spent a good amount of time trying to identify the cause for the CPU usage and were able to identify (and subsequently rule out) three main reasons:
- High khugepaged CPU usage.
- Poorly tuned garbage collection
- Poorly tuned compactions
We have ruled out all three of these things.
- Our servers have 0.0% khugepaged CPU usage.
- Our GC throughput is about 96%. We have also tweaked the heap and new heap sizes as well as switching over to G1 GC. Our logs were once showing warnings related to long GC pauses but no longer do. Also, the GC threads only account for a small amount of CPU usage.
nodetool compactionstats
returns 0 pending tasks. We have switched over to LeveledCompactionStrategy and set the GC_GRACE_SECONDS to 1 day. Our logs were once showing warnings related to large numbers of tombstones but no longer do.nodetool compactionhistory
shows about one compaction per hour and according to the logs they occur extremely quickly (< 1 second).
It appears that Cassandra's SharedPoolWorker
threads have very high usage. Here's one node's CPU usage by type of thread (they all look pretty similar):
84.6 SharedPoolWorker
22.1 Thrift
13.5 CCompilerThread
11.9 MessagingServiceOutgoing
9.4 MessagingServiceIncoming
3.6 GangworkerParallelGCThreads
1.6 DestroyJavaVM
.3 VMThread
.1 Thread
.1 ScheduledTasks
.1 OptionalTasks
0 ...
Checking out the state of the SharedPool-Worker threads show that the vast majority are in WAITING with the following stack trace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:85)
at java.lang.Thread.run(Unknown Source)
I think this is the problem but I am not sure why this may be as very little CPU time is spent waiting (consistently 0% as per dstat
).
Now, interestingly enough, running nodetool tpstats
on any given node shows a small number of ReadStage threads in active, occasionally one or two in pending. There are none blocked, all time blocked, or dropped.
Here's the output to nodetool cfstats
and here's nodetool netstats
:
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 12229
Mismatch (Blocking): 2
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Commands n/a 0 707576 0
Responses n/a 0 859216 n/a
Does anyone have any ideas about why this may be occurring? Any potential things that we can look into?