Datastax agent failing to report metrics once in a while

Question

I'm running a DSE 4.6.5 Cluster (Cassandra 2.0.14.352) with OpsCenter 5.1.1

Once or twice a day, one of the nodes (sometimes more) stops reporting metrics until I manually restart the datastax-agent.

Before I restart the agent, it's alive. Here's the agent log :

WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131176 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131177 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131178 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131179 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131180 operations dropped so far.
ERROR [cassandra-processor-1] 2015-04-14 23:20:24,387 Error when proccessing cassandra callcom.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)

Please note that :

All the nodes are in the same datacenter, with the same hardware specs and the same configuration.
Nodes are using two NICs so rpc_address and listen_address are on different networks
OpsCenter is running on one of the cluster nodes
Writes are intensive : please check my other question

To sum up, on one of the machine (in a round robin fashion), agent stops reporting data while on the other it works fine. Restarting the agent service corrects the issue but shouldn't it restart itself ? Is this a bug ? How can I get around this ?

Please tell me if you need more information. Thanks.

score 1 · Accepted Answer · answered May 11 '15 at 01:00

I've seen this same thing. Two things you can try.

1) Exclude or limit the keyspaces/CF's you collect metrics from. http://docs.datastax.com/en/opscenter/5.1/opsc/configure/opscControllingDataCollection_c.html?scroll=concept_ds_jlq_xk4_gk

2) Run Opscenter on a separate cluster (like a one or two node small cluster separate from your main cluster). http://www.datastax.com/dev/blog/storing-opscenter-data-in-a-separate-cluster

Option 2 is the smarter move honestly, you don't need large nodes, and if you collect metrics on your main cluster and that cluster crashes, you're running blind.

Datastax agent failing to report metrics once in a while

1 Answers1