Cassandra nodes have different opinions about up/down status and replication. How to fix?

Question

I've got a Cassandra 2.0.1 cluster of three nodes and the main keyspace with replication factor 3. Because of an accidental misconfiguration of one extra fourth node in the cluster, I tried to fix it first with an unnecessary "nodetool decommission" (on node db2) before doing the right thing of "nodetool removenode ".

Now, it seems the node db2 where decommission was run is seeing one another node having status "Down", even though others think everything is up. Additionally, when I run "nodetool ring" on all nodes, db1 gives "Replicas: 2" where db2 and db3 have "Replicas: 3" on top of the listing.

The keyspace contains data I don't want to lose, and the cluster can't be taken completely down because new data is being inserted all the time. What would be a good way to fix the situation without endangering the existing and new data?

Obfuscated nodetool status outputs below.

[db1 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  xx.xx.xx.99    30.38 MB   256     100.0%            cccccccc-cccc-cccc-cccc-cccccccccccc  rack1
UN  xx.xx.xx.122   28.93 MB   256     100.0%            aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa  rack1
UN  xx.xx.xx.123   29.59 MB   256     100.0%            bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb  rack1

[db2 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
DN  xx.xx.xx.122   28.93 MB   256     100.0%            aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa  rack1
UN  xx.xx.xx.99    30.38 MB   256     100.0%            cccccccc-cccc-cccc-cccc-cccccccccccc  rack1
UN  xx.xx.xx.123   29.59 MB   256     100.0%            bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb  rack1

[db3 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  xx.xx.xx.122   28.93 MB   256     100.0%            aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa  rack1
UN  xx.xx.xx.99    30.38 MB   256     100.0%            cccccccc-cccc-cccc-cccc-cccccccccccc  rack1
UN  xx.xx.xx.123   29.59 MB   256     100.0%            bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb  rack1

Looks like the problem got magically solved by itself. The problematic node experienced some (possibly unrelated) hard drive issues, had to be rebooted, and did replay a bunch of commit logs and maybe some other auto-repair stuff. But the reasons behind all this remain unclear, even though the symptoms are gone. — Eemeli Kantola, Nov 18 '13 at 14:28
We are facing same issue in production. We have 10 nodes in our cluster. When I run nodetool status, one node is shown DN, but when i run nodetool status again, the same node is shown UN. This problem is consistent. What can be the possible reason? I am sure that this problem is not due to hard drive issue. — abi_pat, Aug 13 '15 at 07:46

score 4 · Answer 1 · answered Nov 06 '13 at 17:24

4

Aaron Morton described in detail how he debugged a similar problem. You should check on the state of gossip in your cluster.

Check the state of nodetool gossipinfo
Enable to following trace logging:

log4j.logger.org.apache.cassandra.gms.Gossiper=TRACE log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE

Hopefully from that you can get a better idea what is going on in your cluster.

answered Nov 06 '13 at 17:24

psanford

5,580
1
26
25

Thanks, this gives some hints how to move forward. But didn't solve the problem yet because it is not exactly the same as Aaron's, which additionally involves Cassandra 1.x and thus not in all ways applicable to 2.0. – Eemeli Kantola Nov 08 '13 at 08:35

Cassandra nodes have different opinions about up/down status and replication. How to fix?

1 Answers1