0

I tried removing a node from a cluster by issuing "nodetool decommission" and have seen netstats to find out how much data is being distributed to other nodes which is all fine.
After the node has been decommissioned, I could see the status of few nodes in a cluster as 'UD' when I run nodetool status on few nodes(Not the one I decommissioned) and few nodes are showing 'UN' as status
I'm quite confused about why the status on nodes is showing such behavior, and not same on all nodes after the decommissioned the node.

Am I missing any steps before and after?
Any comments/Help is highly appreciated!

Avis
  • 496
  • 1
  • 5
  • 18
  • Was the node up and running normal before you decommissioning it? Can you see the Cassandra logs for node unreachable? And also could you brief on the configurations you use like the data centers and replication factor – Shoban Sundar May 14 '18 at 05:26
  • Was your node decommissioned successfully or can you still see it in nodetool status command ? – Payal May 14 '18 at 06:22
  • Ye Payal Node was running all fine and whole cluster is healthy when I initiated decommission. – Avis May 14 '18 at 06:32
  • @ShobanSundar : Whole cluster was healthy when I ran that. There is only one DC and replication factor = 3, and number of nodes in cluster = 18, I see the following message keep coming in those logs which are showing DN WARN [GossipTasks:1] 2018-05-14 07:01:48,127 Gossiper.java:764 - Gossip stage has 708 pending tasks; skipping status check (no nodes will be marked down) – Avis May 14 '18 at 07:03
  • @ShobanSundar , Does it take sometime to sync after decommission the node? I see log of those Gossip messages on almost 8 nodes, Please shed some light on what can be done? – Avis May 14 '18 at 07:29
  • Your gossiper seems stuck if it has 700 pending tasks. Do you get this warning on more then one node? – Simon Fontana Oscarsson May 14 '18 at 07:49
  • @ShobanSundar, Looks like few nodes down, I see the following message in system.log: INFO [GossipTasks:1] 2018-05-14 06:56:42,615 Gossiper.java:1019 - InetAddress /10.10.10.1 is now DOWN -- Not sure why is that , Any idea? – Avis May 14 '18 at 07:51
  • @SimonFontanaOscarsson: Yes I see few nodes down, But that was keep happening when I remove nodes, How to get rid of these node failures when I decommission a node, Please help me out – Avis May 14 '18 at 07:53
  • @SimonFontanaOscarsson: I see those gossip messages on those nodes, but when I run nodetool status on those, it shows status as 'UN'. – Avis May 14 '18 at 08:00
  • What version are you running? – Simon Fontana Oscarsson May 14 '18 at 09:04
  • Please can you add an example in your question on what you see and also the output of `nodetool describecluster` from one of the healthy nodes. Also please share the Cassandra version. – markc May 14 '18 at 09:12
  • @SimonFontanaOscarsson, Am running on Cassandra 3.0.13 – Avis May 15 '18 at 06:19
  • @markc, Here is the o/p of describe cluster: Name: testcluster Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 61c4f06d-5fde-306b-bbee-e09f89e9d7d9: [10.0.10.1, 10.0.10.2, 10.0.10.3, 10.0.10.4, 10.0.10.5, 10.0.10.6, 10.0.10.7, 10.0.10.8, 10.0.10.9, 10.0.10.10] – Avis May 15 '18 at 06:24
  • @markc, The problem I see after decommission a node from cluster is: The nodetool status doesn't show exact status. It shows 'UN' on some nodes and 'UD' on some. ( fyi: I was able to execute a query on 'DN' nodes though) Let me know if I answered what you're looking for... – Avis May 15 '18 at 06:32
  • @Avis - quite difficult to diagnose this from what we have here. Perhaps if you are unable to put details of your cluster here then share info via pastebin or github gists? The cluster schema seems to be in agreement so thats good. The log snippet you have about pending gossip tasks seems to infer the node is under heavy load. Post some details like `nodetool status` and the `system.log` from the node you are trying to decommission, then this will give us more to go on. – markc May 15 '18 at 12:37

1 Answers1

3

If gossip information is not the same in all nodes, then you should do a rolling restart on the cluster. That will make gossip reset in all nodes.

Was the node you removed a seed node? If it was, don't forget to remove the IP from the cassandra.yaml in all nodes.

Pedro Gordo
  • 1,825
  • 3
  • 21
  • 45
  • Hi Pedro, It was not a seed node, and gossip information is same on all nodes. – Avis May 21 '18 at 15:10
  • What makes gossip information go wrong on other nodes? any thing to take care before or does it take time to sync across nodes? – Avis May 24 '18 at 06:46
  • @Avis without knowing the whole context it's difficult to say what might have led to that in your case, but probably it's due to packets lost in the network, which drives to gossip inconsistency. Yes, I know, "always blame the network"... :) – Pedro Gordo Jul 02 '18 at 09:20