Cassandra clustering fail over-High Avialability

Question

I have configured a cassandra clustter with 3 nodes

Node1(192.168.0.2) , Node2(192.168.0.3), Node3(192.168.0.4)

Created a keyspace 'test' with replication factor as 2.

Create KEYSPACE test WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 2}

When I stop either Node2 or Node3 (one at a time and both at one time) , I am able to do the CRUD operations on the keyspace.table.

When I stop Node1 and try to update/create a row from Node4 or Node3, getting following error although Node3 and Node4 are up and running-:

All host(s) tried for query failed (tried: /192.168.0.4:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))) com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /192.168.0.4:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))

I am not sure how Cassandra elects a leader if a leader node dies.

There is no concept of leader in cassandra... check if you can telnet to host (192.168.0.4) on port 9042 — undefined_variable, Feb 16 '17 at 06:30
Could you provide more information about the Consistency level used on queires (this has a huge impact on behavior you are expecting)? Are you using a driver or accessing using cqlsh? — Arthur Landim, Feb 16 '17 at 12:00
@undefined_variable .... yes I am able to telnet from my local desktop to all the nodes on port 9042. — UAnand, Feb 16 '17 at 16:13
@ArthurLandim .... I am using DBeaver Enterprise and connecting to the nodes by cassandra cql to execute my queries. — UAnand, Feb 16 '17 at 16:15
@ArthurLandim.... The queries are listed below -: CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 2} CREATE TABLE test.emp( emp_id int PRIMARY KEY, emp_name text, emp_city text, emp_sal varint, emp_phone varint ) INSERT INTO test.emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(11,'JOhn', 'Fort Worth', 434333333, 150000) — UAnand, Feb 16 '17 at 16:25

score 0 · Answer 1 · edited May 23 '17 at 12:00

So, you are using replication_factor 2, so only 2 nodes will have a replica of you keyspace (not all the 3 nodes).

My first advise is to change the RF to 3.
You have to pay attention to the consistency level you are using; If you have only 2 copies of you data (RF: 2), and you are using Consistency Level QUORUM, it will try to write the data on half of nodes + 1, in this case, all 2 nodes. So if 1 node is down, you will not be able to write/read data.
to verify where the data is replicated you could see how is the ring in you cluster. As you are using SimpleStrategy it will copy the data clockwise direction. And in your case its copied at nodes at 192.168.0.2 and 192.168.0.3.
Take a look at the concepts of replication factor: http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
And Consistency Level: http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Great answer about RF vs CL: https://stackoverflow.com/a/24590299/6826860

You can use this calculator to find out if your setup have a decent consistency. In your case the result is You can survive the loss of no nodes without impacting the application

I changed RF:3 and tried again with stopped Node1 and Node2,Node3 up. But still it gives me below error (Cannot achieve consistency level QUORUM). How to achieve the consistency level QUORUM ? Is there any specific way to do so? All host(s) tried for query failed (tried: /192.168.0.3:9042 (com.datastax.driver.core.exceptions.ServerError: An unexpected error occurred server side on /192.168.0.3:9042: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM) — UAnand, Feb 16 '17 at 20:06
Cassandra version I using it is 2.1.8. and also I set the consistency level to QUORUM. — UAnand, Feb 16 '17 at 21:27

score 0 · Answer 2 · answered Feb 17 '17 at 14:05

I think I wasn't clear at response. The replication factor is about how many copies of your data will exists. The consistency level is how many copies your client will wait to be made before get an response from server. Ex: All your nodes are up. The client make a CQL with CL Quorum, the server will copy the data in 2 nodes (3/2 + 1) and reply to client, in background it will copy the data at the third node as well.

In your example, if you shutdown 2 nodes of a 3 node cluster you will never achieve an QUORUM to make requests (with CL QUORUM), so you have to use consistency level ONE, once the nodes are up again, cassandra will copy the data on them. One thing that can happen is: before cassandra copy the data on other 2 nodes, the client make a request for node1 or node2 and the data is not there yet.

Thanks.. the issue is resolved. I changed the Consistency level as QUORUM and replication factor as 3 and also in cassandra.yaml, I commented the num_token and generated the initial_token for 3 nodes. By making these changes, the cluster is working fine. So in the cluster of 3 nodes, only one node can be down at any point of time. 2 nodes should always be up for high availability. — UAnand, Feb 21 '17 at 17:47

Cassandra clustering fail over-High Avialability

2 Answers2