7

I'm trying to figure out how to migrate data from one cassandra cluster, to another cassandra cluster of a different ring size...say from a 5 node cluster to a 7 node cluster.

I started looking at sstable2json, since it creates a json file for the SSTable on that specific cassandra node. My thought was to do this for a column family on each node in the ring. So on a 5 node ring, this would give me 5 json files, one file for the data stored on in the column family that resides on each node.

Then I'd merge the json files into one file, and use json2sstable to import into a new cluster, of size, lets say 7. I was hoping that cassandra would then replicate/balance the data out evenly across the nodes in the ring, but I just read that SSTables are immutable once written. So if I did what I just mentioned, I'd end up with a ring with all the data in my column family on one node.

So can anyone help me figure out the process for migrating data from one cluster to a different cluster of a different ring size?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Turbo
  • 2,490
  • 4
  • 25
  • 30

4 Answers4

9

Better: use bin/sstableloader on the sstables from the old ring, to stream to the new one.

Normally sstableloader is used in a sequence like this:

  1. Create sstables locally using SSTableWriter
  2. Use sstableloader to stream the data in the sstables to the right nodes (bin/sstableloader path-to-directory-full-of-sstables). The directory name is assumed to be the keyspace, which will be the case if you point it at an existing Cassandra data directory.

Since you're looking to stream data from an existing cluster A to a new cluter B, you can skip straight to running sstableloader against the data on each node in cluster A.

More details on using sstableloader in this blog post.

jbellis
  • 19,347
  • 2
  • 38
  • 47
  • If I write straight to the sstable, will cassandra replicate the data to the other nodes? Also, if I put all the data files from all nodes in the ring into one folder, and run sstableloader on it, wont I end up with duplicate data, since the data was replicated across 3 nodes in the original cluster? – Turbo Jul 23 '11 at 04:14
  • 1
    Yes, you'll end up with duplicate data. Compaction will take care of that, so it's not a problem. Also: there is no need to "put all the data files into one folder," just do it in-place. (Edited to clarify.) – jbellis Jul 24 '11 at 04:19
  • Hey jbellis, thanks for the info. I think this is the route to take. To add more info about my scenario, I'm using hadoop on ec2 to generate a data model and persist it on cassandra, also on ec2. Then when the model is built, I'll be pulling the cassandra data down to my network. I'll be creating a fairly large cassandra ring on ec2 to get some scalability benifits when generating the model. But the destination ring in my network will be smaller, probably by half. So my plan is to pull data files down from ec2 to my network and then import the data. more next comment... – Turbo Jul 25 '11 at 16:30
  • So if I'm going from a 13 node cluster on ec2 to a local 5 node cluster, and I want to use sstableloader, do I have to worry about the data getting replicated across my 5 nodes correctly? Or can I just randomly pick 2-3 nodes from the ec2 cluster and load their data on any local node, and cassandra will properly balance the data across the new cluster? I'm worried that I might pick 3 ec2 nodes that have all copies of replicated data, and load it on one local node. If compaction corrects duplicate copies of data, will I in effect have lost my replication of the data in the ring? – Turbo Jul 25 '11 at 16:40
  • 1
    Oh joy! After reading the code for the sstableloader, I get now that it writes the data to the cluster (meaning all nodes), not just to one specific node like json2sstable. That was my confusion. This makes the task very easy! I dont have to worry about replication or anything. It looks like as it write each piece of data, it treats it like any normal cassandra write and does the replication for me. sweety! – Turbo Jul 25 '11 at 17:17
  • Took some work to figure out how exactly to use sstableloader, but got it going. Here is a write-up on how it works: http://geekswithblogs.net/johnsPerfBlog/archive/2011/07/26/how-to-use-cassandrs-sstableloader.aspx – Turbo Jul 26 '11 at 19:53
  • 1
    So - if I want to copy data from 3 nodes to a new cluster. Can I bundle all sstables together in one directory, or do I have to run sstableloader 3 times for each table I want to load? – polve Sep 10 '14 at 10:31
0

You may do some steps as following: 1. Join 7 nodes into 5 nodes clusters and set up each node with its own ring token. At this time, you may have a cluster with 12 nodes. 2. Remove 5 nodes from new cluster in step 1. 3. Set up the token ring for each node after moving 5 nodes in your own. 4. Repairing the 7 nodes cluster.

John
  • 67
  • 1
  • 4
0

You don't need to use sstable2json. If you have the space you can:

  1. get all the sstables from all of the nodes on the old ring
  2. put them all together on each of the new servers (renaming any which have the same names)
  3. run nodetool cleanup on each node in the new ring and they will throw away the data that doesn't belong to them.
Zanson
  • 3,991
  • 25
  • 31
  • Would this work if the two rings are of different sizes? Say the original ring is 12 nodes, and the new ring is 5 nodes? – Turbo Jul 23 '11 at 04:17
  • Yes. But the sstableloader script mentioned by @jbellis in his answer is better. Snapshot the current nodes, then run sstableloader from each of the snapshot dirs to the new cluster. – Zanson Aug 08 '11 at 04:22
-1

I would venture to say that this isn't as big of a problem as it may seem.

  1. Create your new ring and define the tokens for each node appropriately as per http://wiki.apache.org/cassandra/Operations#Token_selection
  2. Import data into the new ring.
  3. The ring will balance itself based on the tokens you have defined http://wiki.apache.org/cassandra/Operations#Import_.2BAC8_export
sdolgy
  • 6,963
  • 3
  • 41
  • 61
  • Two questions. When you say import data into the new ring, specifically how do I do that? What tools? Does it matter if the new ring is a different size than the original ring? – Turbo Jul 23 '11 at 04:15
  • Links does not work anymore. The second point is vague and useless – ftrujillo Aug 22 '16 at 07:47