3

When I try to decommission a node in my Cassandra cluster, the process starts (I see active streams flowing from the node to decommission to the other nodes in the cluster (using vnodes)), but then after a little delay nodetool decommission exists with the following error message.

I can repeatedly run nodetool decommission and it will start streaming data to other nodes, but so far always exists with the below error.

Why am I seeing this, and is there a way I can safely decommission this node?

Exception in thread "main" java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:578)
        at org.apache.cassandra.db.HintedHandOffManager.listEndpointsPendingHints(HintedHandOffManager.java:528)
        at org.apache.cassandra.service.StorageService.streamHints(StorageService.java:2854)
        at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2834)
        at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2795)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
        at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
        at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1454)
        at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:74)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1295)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1387)
        at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:818)
        at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
        at sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
        at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1213)
        at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:573)
        ... 33 more
Gordon Seidoh Worley
  • 7,839
  • 6
  • 45
  • 82
  • To anyone finding this later, I was able to decommission the node safely, I think, by watching the streams with `nodetool netstats` and just waiting until everything was done streaming off the node. Then I took it down and ran a repair across the cluster. – Gordon Seidoh Worley Mar 02 '14 at 21:11
  • Were you able to get it done any other way? If you have a very large cluster and repairs take months to complete, then doing this to one node and running repairs around the cluster isn't feasible when you have 25+ nodes to replace. – Eric Lubow Apr 10 '14 at 02:10
  • I actually ended up addressing the issue by using larger nodes. The problem was fundamentally that the nodes didn't have the resources to run Cassandra reliably. I was using m1.xlarge on aws at the time of this issue. Moving to i2.2xlarge alleviated the problem. There seems to be some real limits on how small a node can be and still function when you throw a lot of data at it. – Gordon Seidoh Worley Apr 10 '14 at 20:09
  • That's actually exactly what we are trying to do is move our entire cluster from m1's to i2's. The problem is that we can't decom the nodes to replace them. – Eric Lubow Apr 11 '14 at 16:59
  • You should know about the data center trick then: you can bring up a second data center with the nodes you want, then switch off the original data center to replace the cluster. If you already use data centers you can do this by replacing each data center in turn. See http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_add_dc_to_cluster_t.html for details on how to do this. – Gordon Seidoh Worley Apr 11 '14 at 19:50
  • That is a lot of machines to spin up and then wait for a lot of terabytes of data to transfer and then spin down an old datacenter. This also still leaves the problem of having to just shut down the old DC because decoms don't work. This is a hack at best assuming you have a small cluster. – Eric Lubow Apr 13 '14 at 14:24
  • 1
    It shouldn't be much of an issue: you can run the rebuild all at once across the entire new data center and in my experience it does a good job of parallelizing the work so you are only faced with the intermachine transfer times, which internal on aws are acceptable. You then don't need to decommission the old data center: you can just shut it off directly after pointing clients to the new dc and running the repair on the new dc. There's nothing left on the old nodes that would be lost, so you can just cut them off without a decom. – Gordon Seidoh Worley Apr 14 '14 at 16:51

2 Answers2

1

The hinted handoff manager is checking for hints to see if it needs to pass those off during the decommission so that the hints don't get lost. You most likely have a lot of hints, or a bunch of tombstones, or something in the table causing the query to timeout. You aren't seeing any other exceptions in your logs before the timeout are you? Raising the read timeout period on your nodes before you decommission them, or manually deleting the hints CF, should most likely get your past this. If you delete them, you would then want to make sure you ran a full cluster repair when you are done with all of your decommissions, to propagate data from any hints you deleted.

Zanson
  • 3,991
  • 25
  • 31
0

The short answer is that the node I was trying to decommission was underpowered for the amount of data it held. As of this writing there seems to be a reasonable hard minimum of resources needed to handle nodes with arbitrary amounts of data, which seems to be somewhere in the neighborhood of what an AWS i2.2xlarge provides. In particular, the old m1 instances let you get into trouble by allowing you to store far more data on each node than the memory and compute resources available can support on it.

Gordon Seidoh Worley
  • 7,839
  • 6
  • 45
  • 82
  • I'm currently using m1.xlarge's for my cluster. How much data on average did you have on each node, when you got into trouble? I'm worried about running into the same problem you did, and want to cap the total size of the data. – worker1138 Jul 14 '15 at 21:37