0

datastax-enterprise

datastax-startup

We are using DataStax DSE Cluster.

We are trying to migrate a table to another table with same definition as the 1st table but with a secondary index

It has about 1.7M rows

1) We first user Cassandra COPY command from cqlsh. It is taking a long time > 1 hr. Timeout, didn't work 2) We then write a program to export the 1st table to CSV file. We break this CSV file into separate CSV files, and try to load it against the 2nd table.

The insert takes sometime, and it fails

3) We are looking into http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

Since we have CSV file, is this the right approach?

And we are using this lib https://github.com/yukim/cassandra-bulkload-example to generate the SSTABLE.

Is it the right way to handle this?

1 Answers1

1

If you have a csv I would recommend using this bulk loader:

https://github.com/brianmhess/cassandra-loader

If you have Spark analytics enabled on your cluster:

sc.cassandraTable("ks1","table").saveToCassandra("ks2","table")

See also:

http://docs.datastax.com/en/latest-dse/datastax_enterprise/migration/migratingBulkSparkRDD.html

Iain
  • 56
  • 2
  • Hi Lain, we do have the the CSV file available. With this tool, I guess we first have to create the schema on the destination DB. And then run the loading tool there. Is that right? Also, we are having trouble in the connection drop from a remote host, is it better to run this loading tool on the node where the main cassandra node is run? – Darwin Ling Apr 12 '16 at 23:47
  • 1
    Also, why this tool is better? I have inspected the code, it looks like it is using ConnectionPool only, how is this better than the sshtableload? – Darwin Ling Apr 12 '16 at 23:55
  • This slideshare compares the sstableloader, COPY and the cassandra loader app: http://www.slideshare.net/BrianHess4/bulk-loading-into-cassandra The big advantage is not having to write sstables. You can also tolerate node failers with the loader but not with sstableloader. It's usually best to run clients on separate hosts. If your connection is dropping you could try dropping the number of futures and/or the rate. – Iain Apr 13 '16 at 21:46