I'm running a DSE 4.6.5 Cluster (Cassandra 2.0.14.352) with OpsCenter 5.1.1
Once or twice a day, one of the nodes (sometimes more) stops reporting metrics until I manually restart the datastax-agent.
Before I restart the agent, it's alive. Here's the agent log :
WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131176 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131177 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131178 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131179 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131180 operations dropped so far.
ERROR [cassandra-processor-1] 2015-04-14 23:20:24,387 Error when proccessing cassandra callcom.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
Please note that :
- All the nodes are in the same datacenter, with the same hardware specs and the same configuration.
- Nodes are using two NICs so rpc_address and listen_address are on different networks
- OpsCenter is running on one of the cluster nodes
- Writes are intensive : please check my other question
To sum up, on one of the machine (in a round robin fashion), agent stops reporting data while on the other it works fine. Restarting the agent service corrects the issue but shouldn't it restart itself ? Is this a bug ? How can I get around this ?
Please tell me if you need more information. Thanks.