1

Upgraded to the latest DSE Opscenter 6.0 via apt on Ubuntu 14.10 LTS from Opscenter 5. After clean installs, 3-4 agents in a 3-dc/6-node DSE 4.8 cluster are reporting issues as shown.

Opscenter Agent status page

The agent logs on the nodes show issues as below:

 INFO [async-dispatch-22] 2016-07-13 22:38:46,208 Starting monitored database connection.
 ERROR [async-dispatch-22] 2016-07-13 22:39:00,566 Can't connect to Cassandra (All host(s) tried for query failed (tried: /172.30.0.217:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/172.30.0.217] Operation timed out))), retrying soon.
  INFO [async-dispatch-22] 2016-07-13 22:39:00,570 Starting JMXComponent
 ERROR [async-dispatch-22] 2016-07-13 22:40:00,631 Error starting JMXComponent
 java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
        java.net.SocketTimeoutException: Read timed out]
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:369)
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:249)
        at opsagent.jmx$create_jmx_pool_with_config$wrapper__11498.doInvoke(jmx.clj:216)
        at clojure.lang.RestFn.invoke(RestFn.java:439)
        at opsagent.jmx.JMXComponent.start(jmx.clj:320)
        at com.stuartsierra.component$fn__8837$G__8831__8839.invoke(component.clj:4)
        at com.stuartsierra.component$fn__8837$G__8830__8842.invoke(component.clj:4)
        ........

OR

INFO [async-dispatch-22] 2016-07-13 22:40:00,635 Starting JMXComponent
 ERROR [async-dispatch-22] 2016-07-13 22:41:00,676 Error starting JMXComponent
 java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
        java.net.SocketTimeoutException: Read timed out]
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:369)
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:249)
        at opsagent.jmx$create_jmx_pool_with_config$wrapper__11498.doInvoke(jmx.clj:216)
        at clojure.lang.RestFn.invoke(RestFn.java:439)
        at opsagent.jmx.JMXComponent.start(jmx.clj:320)
        at com.stuartsierra.component$fn__8837$G__8831__8839.invoke(component.clj:4)
        at com.stuartsierra.component$fn__8837$G__8830__8842.invoke(component.clj:4)
        at clojure.lang.Var.invoke(Var.java:379)
        at clojure.lang.AFn.applyToHelper(AFn.java:154)

This is the cluster which ran the previous Opscenter on the same DSE 4.8 cluster for 2 years without any issues. Any light on troubleshooting this is highly appreciated.

c360ian
  • 1,253
  • 1
  • 10
  • 18
  • Is port 8888 open between the nodes and your opscenter machine? Please test from the node to the opscenter machine `telnet 8888` – phact Jul 14 '16 at 00:13
  • Yes, 8888 port on the Opscenter node is open for all and is the port used to access the UI. – c360ian Jul 14 '16 at 00:20
  • 1
    The error is the opscenter agent being unable to connect to Cassandra via JMX. By default its connecting to localhost:7199, can you post output of `netstat -an | grep 7199` to see if C* is listening? Your cassandra-env.sh is helpful to look at as well (if rmi.hostname is changed) – Chris Lohfink Jul 14 '16 at 00:45
  • Thanks for looking into this. Here is the output: `$ sudo netstat -apn | grep 7199 tcp 0 0 127.0.0.1:7199 0.0.0.0:* LISTEN 28408/java tcp 0 0 127.0.0.1:7199 127.0.0.1:40133 ESTABLISHED 28408/java tcp6 0 0 127.0.0.1:40218 127.0.0.1:7199 TIME_WAIT - tcp6 0 0 127.0.0.1:40133 127.0.0.1:7199 ESTABLISHED 19508/java ` – c360ian Jul 14 '16 at 01:01
  • 1
    So the port seems to be open but this is happening when theres a read timeout (`sun.rmi.transport.tcp.handshakeTimeout`) occuring on the connect between agent and Cassandra. Is there a chance the rmi hostname is something that isnt resolvable? May be worth un commenting the `JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname="` in `cassandra-env.sh` and set to `localhost` or `127.0.0.1`, then restarting cassandra. – Chris Lohfink Jul 14 '16 at 03:10
  • The reason I was thinking 8888 is because most of the HTTP in that screenshot seem to be showing as down. Where as there's only 1 box with a JMX and storage issue. That one box might actually be down (the dse process is stopped). For the http error, if 8888 is open, try just restarting the agent and opscenterd. – phact Jul 14 '16 at 05:03
  • @ChrisLohfink The hostname doesn't seem to be issue since the connections are staying up for some time before timing out. whether its HTTP or JMX, the connections are not staying put. They all fail unpredictably on one or more machine with a 2 hour period prompting to reinstall the agent again. – c360ian Jul 15 '16 at 01:11

0 Answers0