Completely unbalanced DC after bootstrapping new node

Question

I've just added a new node new into my Cassandra DC. Previously, my topology is as follows:

DC Cassandra: 1 node
DC Solr: 5 nodes

When I bootstrapped a 2nd node for the Cassandra DC, I noticed that the total bytes to be streamed is almost as big as the load of the existing node (916gb to stream; load of existing cassandra node is 956gb). Nevertheless, I allowed the bootstrap to proceed. It completed a few hours ago and now my fear is confirmed: the Cassandra DC is completely unbalanced.

Nodetool status shows the following:

Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                        Load       Owns (effective)  Host ID                               Token                                    Rack
UN  solr node4                                     322.9 GB   40.3%             30f411c3-7419-4786-97ad-395dfc379b40  -8998044611302986942                     rack1
UN  solr node3                                     233.16 GB  39.7%             c7db42c6-c5ae-439e-ab8d-c04b200fffc5  -9145710677669796544                     rack1
UN  solr node5                                     252.42 GB  41.6%             2d3dfa16-a294-48cc-ae3e-d4b99fbc947c  -9004172260145053237                     rack1
UN  solr node2                                     245.97 GB  40.5%             7dbbcc88-aabc-4cf4-a942-08e1aa325300  -9176431489687825236                     rack1
UN  solr node1                                     402.33 GB  38.0%             12976524-b834-473e-9bcc-5f9be74a5d2d  -9197342581446818188                     rack1
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                        Load       Owns (effective)  Host ID                               Token                                    Rack
UN  cs node2                                       705.58 GB  99.4%             fa55e0bb-e460-4dc1-ac7a-f71dd00f5380  -9114885310887105386                     rack1
UN  cs node1                                      1013.52 GB  0.6%              6ab7062e-47fe-45f7-98e8-3ee8e1f742a4  -3083852333946106000                     rack1

Notice the 'Owns' column in the Cassandra DC: node2 owns 99.4% while node1 owns 0.6% (despite node2 having smaller 'Load' than node1). I expect them to own 50% each but this is what I got. I don't know what caused this. What I can remember is that I'm running a full repair in Solr node1 when I started the bootstrap of the new node. The repair is still running as of this moment (I think it actually restarted when the new node finished bootstrapping)

How do I fix this? (repair?)

Is it safe to bulk-load new data while the Cassandra DC is in this state?

Some additional info:

DSE 4.0.3 (Cassandra 2.0.7)
NetworkTopologyStrategy
RF1 in Cassandra DC; RF2 in Solr DC
DC auto-assigned by DSE
Vnodes enabled
Config of new node is modeled after the config of the existing node; so more or less it is correct

EDIT:

Turns out that I can't run cleanup too in cs-node1. I'm getting the following exception:

Exception in thread "main" java.lang.AssertionError: [SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-18509-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-18512-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38320-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38325-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38329-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38322-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38330-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38331-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38321-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38323-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38344-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38345-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38349-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38348-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38346-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-13913-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-13915-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38389-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-39845-Data.db'), SSTableReader(path='/home/cassandra/data/my_ks/my_cf/my_ks-my_cf-jb-38390-Data.db')]
    at org.apache.cassandra.db.ColumnFamilyStore$13.call(ColumnFamilyStore.java:2115)
    at org.apache.cassandra.db.ColumnFamilyStore$13.call(ColumnFamilyStore.java:2112)
    at org.apache.cassandra.db.ColumnFamilyStore.runWithCompactionsDisabled(ColumnFamilyStore.java:2094)
    at org.apache.cassandra.db.ColumnFamilyStore.markAllCompacting(ColumnFamilyStore.java:2125)
    at org.apache.cassandra.db.compaction.CompactionManager.performAllSSTableOperation(CompactionManager.java:214)
    at org.apache.cassandra.db.compaction.CompactionManager.performCleanup(CompactionManager.java:265)
    at org.apache.cassandra.db.ColumnFamilyStore.forceCleanup(ColumnFamilyStore.java:1105)
    at org.apache.cassandra.service.StorageService.forceKeyspaceCleanup(StorageService.java:2220)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
    at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
    at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
    at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
    at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
    at sun.rmi.transport.Transport$1.run(Transport.java:177)
    at sun.rmi.transport.Transport$1.run(Transport.java:174)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

EDIT:

Nodetool status output (without keyspace)

Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                        Load       Owns   Host ID                               Token                                    Rack
UN  solr node4                                     323.78 GB  17.1%  30f411c3-7419-4786-97ad-395dfc379b40  -8998044611302986942                     rack1
UN  solr node3                                     236.69 GB  17.3%  c7db42c6-c5ae-439e-ab8d-c04b200fffc5  -9145710677669796544                     rack1
UN  solr node5                                     256.06 GB  16.2%  2d3dfa16-a294-48cc-ae3e-d4b99fbc947c  -9004172260145053237                     rack1
UN  solr node2                                     246.59 GB  18.3%  7dbbcc88-aabc-4cf4-a942-08e1aa325300  -9176431489687825236                     rack1
UN  solr node1                                     411.25 GB  13.9%  12976524-b834-473e-9bcc-5f9be74a5d2d  -9197342581446818188                     rack1
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                        Load       Owns   Host ID                               Token                                    Rack
UN  cs node2                                       709.64 GB  17.2%  fa55e0bb-e460-4dc1-ac7a-f71dd00f5380  -9114885310887105386                     rack1
UN  cs node1                                      1003.71 GB  0.1%   6ab7062e-47fe-45f7-98e8-3ee8e1f742a4  -3083852333946106000                     rack1

Cassandra yaml from node1: https://www.dropbox.com/s/ptgzp5lfmdaeq8d/cassandra.yaml (only difference with node2 is listen_address and commitlog_directory)

Regarding CASSANDRA-6774, it's a bit different because I didn't stop a previous cleanup. Although I think I took a wrong route now by starting a scrub (still in-progress) instead of restarting the node first just like their suggested workaround.

UPDATE (2014/04/19):

nodetool cleanup still fails with an assertion error after doing the following:

Full scrub of the keyspace
Full cluster restart

I'm now doing a full repair of the keyspace in cs-node1

UPDATE (2014/04/20):

Any attempt to repair the main keyspace in cs-node1 fails with:

Lost notification. You should check server log for repair status of keyspace

I also saw this just now (output of dsetool ring)

Note: Ownership information does not include topology, please specify a keyspace.
Address          DC           Rack         Workload         Status  State    Load             Owns                 VNodes
solr-node1       Solr         rack1        Search           Up      Normal   447 GB           13.86%               256
solr-node2       Solr         rack1        Search           Up      Normal   267.52 GB        18.30%               256
solr-node3       Solr         rack1        Search           Up      Normal   262.16 GB        17.29%               256
cs-node2         Cassandra    rack1        Cassandra        Up      Normal   808.61 GB        17.21%               256
solr-node5       Solr         rack1        Search           Up      Normal   296.14 GB        16.21%               256
solr-node4       Solr         rack1        Search           Up      Normal   340.53 GB        17.07%               256
cd-node1         Cassandra    rack1        Cassandra        Up      Normal   896.68 GB        0.06%                256
Warning:  Node cs-node2 is serving 270.56 times the token space of node cs-node1, which means it will be using 270.56 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management
Warning:  Node solr-node2 is serving 1.32 times the token space of node solr-node1, which means it will be using 1.32 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management

Keyspace-aware:

Address          DC           Rack         Workload         Status  State    Load             Effective-Ownership  VNodes
solr-node1       Solr         rack1        Search           Up      Normal   447 GB           38.00%               256
solr-node2       Solr         rack1        Search           Up      Normal   267.52 GB        40.47%               256
solr-node3       Solr         rack1        Search           Up      Normal   262.16 GB        39.66%               256
cs-node2         Cassandra    rack1        Cassandra        Up      Normal   808.61 GB        99.39%               256
solr-node5       Solr         rack1        Search           Up      Normal   296.14 GB        41.59%               256
solr-node4       Solr         rack1        Search           Up      Normal   340.53 GB        40.28%               256
cs-node1         Cassandra    rack1        Cassandra        Up      Normal   896.68 GB        0.61%                256
Warning:  Node cd-node2 is serving 162.99 times the token space of node cs-node1, which means it will be using 162.99 times more disk space and network bandwidth. If this is unintentional, check out http://wiki.apache.org/cassandra/Operations#Ring_management

This is a strong indicator that something is wrong with the way cs-node2 bootstrapped (as I described at the beginning of my post).

for your exception there seems to be a JIRA issue logged for Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-6774 — Roman Tumaykin, May 16 '14 at 06:29
For your original question, can you please post your cassandra.yaml files from both Cassandra nodes? Also nodetool status output (without keyspace) — Roman Tumaykin, May 16 '14 at 06:30
Just to check, it looks like you are using the Mumur3 partitioner and the DSEdelegate snitch, what snitch is configured in your dse.yaml? — markc, May 15 '15 at 11:44

score 0 · Answer 1 · answered Nov 22 '15 at 20:30

It looks like your issue is that you most likely switch from single tokens to vnodes on your existing nodes. So all of their tokens are in a row. This is actually not possible to do in current Cassandra versions because it was too hard to get right.

The only real way to fix it and be able to add a new node is to decommission the first new node you added, then follow the current documentation on switching to vnodes from single nodes, which is basically that you need to make brand new data centers with brand new vnodes using nodes in them and then decommission the existing nodes.

Completely unbalanced DC after bootstrapping new node

1 Answers1