3

I'm running a DSE 4.6.5 Cluster (Cassandra 2.0.14.352) with OpsCenter 5.1.1

Once or twice a day, one of the nodes (sometimes more) stops reporting metrics until I manually restart the datastax-agent.

Before I restart the agent, it's alive. Here's the agent log :

WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131176 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131177 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131178 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131179 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131180 operations dropped so far.
ERROR [cassandra-processor-1] 2015-04-14 23:20:24,387 Error when proccessing cassandra callcom.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)

Please note that :

  • All the nodes are in the same datacenter, with the same hardware specs and the same configuration.
  • Nodes are using two NICs so rpc_address and listen_address are on different networks
  • OpsCenter is running on one of the cluster nodes
  • Writes are intensive : please check my other question

To sum up, on one of the machine (in a round robin fashion), agent stops reporting data while on the other it works fine. Restarting the agent service corrects the issue but shouldn't it restart itself ? Is this a bug ? How can I get around this ?

Please tell me if you need more information. Thanks.

Community
  • 1
  • 1

1 Answers1

1

I've seen this same thing. Two things you can try.

1) Exclude or limit the keyspaces/CF's you collect metrics from. http://docs.datastax.com/en/opscenter/5.1/opsc/configure/opscControllingDataCollection_c.html?scroll=concept_ds_jlq_xk4_gk

2) Run Opscenter on a separate cluster (like a one or two node small cluster separate from your main cluster). http://www.datastax.com/dev/blog/storing-opscenter-data-in-a-separate-cluster

Option 2 is the smarter move honestly, you don't need large nodes, and if you collect metrics on your main cluster and that cluster crashes, you're running blind.

petecheslock
  • 161
  • 6