0

I am trying to upload a csv data file to cassandra cluster. This should be a continuous process for which I am creating a simple java app that will read the csv file and then convert it to SSTable and then upload it to cassandra cluster.

I am able to achieve the first step using CQLSSTableWriter and was able to create a local SSTable data. From what I searched, I understood that we have to use BulkLoader give by apache.cassandra.tools to upload the SSTable to cluster. I couldn't figure that part. Also, my SSTable copy will be in local and not in the server where cluster is running. Can someone help me on how to achieve it with an example if possible, that will really help.

To add : My actual use case is to archive data from Sybase to cassandra continuously for which I am trying to create a csv of Sybase data and upload that to Sybase, as the data will be in millions. Any other approaches are also welcome here.

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23

1 Answers1

1

The bulk loader you're referring to is the sstableloader utility that is available in the tools/bin/ directory of your Cassandra installation. The sstableloader utility streams the SSTables to load their contents to a Cassandra cluster.

However, your approach is inefficient because it's not necessary to convert the CSV data into SSTables.

The DataStax Bulk Loader tool (DSBulk) was written specifically for this purpose. It allows you to bulk load data in CSV or JSON format to Cassandra. You can also use DSBulk to export data from Cassandra to CSV or JSON.

Here are some references with examples to help you get started quickly:

DSBulk is fully open-source so it's free to use. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • Thank you Eric for the response. I have tried using the dstool as well and I am facing connection errors while running it from terminal. I have added the comment in datastax on a old thread. please find the details below. – harish bollina Feb 28 '23 at 12:35
  • dsbulk { # Example to set connector name: # connector.name = csv --dsbulk.connector.name = csv --dsbulk.connector.csv.url = --dsbulk.connector.csv.header true --driver.basic.contact-points = [ "169.XX.XXX.XX", "169.XX.XXX.XX", "169.XX.XXX.XX" ] --driver.advanced.auth-provider.username = --driver.advanced.auth-provider.password = --schema.keyspace = --schema.table = --driver.advanced.ssl-engine-factory.keystore-path = --driver.advanced.ssl-engine-factory.keystore-password = }
    – harish bollina Feb 28 '23 at 12:41
  • I have used the same settings above from intellij to connect to cassandra and was able to connect. However sslenable=true was added as option from intellij. My use case is to archive millions of records from Sybase to Cassandra which will be a continuous process and I am planning on creating a simple java app that calls this dsbulk script. Please let me know for any further info Sorry for adding multiple comments, but wanted to share all the info that I tried with – harish bollina Feb 28 '23 at 12:41
  • Have updated dsbulk config to use datastax driver properties like below dsbulk { --dsbulk.connector.name = csv --dsbulk.connector.csv.url = --dsbulk.connector.csv.header true --datastax-java-driver.basic.contact-points = [ "169.XX.XXX.XX" ] --datastax-java-driver.advanced.a..name = --datastax-java-driver.advanced.a..word = --dsbulk.schema.keyspace = --dsbulk.schema.table = --da..ax-java-d..er.a...d.ssl-engine-factory.tru..re-path = --da..ax-java-d..er.a..d.ssl-engine-factory.tru..re-password = }
    – harish bollina Feb 28 '23 at 16:50
  • the above dsbulk config is giving the below errors `Could not reach any contact point, make sure you've provided valid addresses` `An existing connection was forcibly closed by the remote host` I have used the same connection prop from intellij to connect to cassandra and is working fine. Can you please help me what could be the porblem here – harish bollina Feb 28 '23 at 16:54
  • Will you please post a new question with details of your issue? Otherwise, the comments will be never ending if we keep answering new questions in threads. Cheers! – Erick Ramirez Mar 01 '23 at 00:54
  • Accepting the answer as I believe dsbulk is the only feasible option fro my use case. Follow up question was raised under https://stackoverflow.com/questions/75598991/dsbulk-failing-to-connect-to-remote-cluster-to-load-csv-data – harish bollina Mar 01 '23 at 02:17