Questions tagged [dsbulk]

DataStax Bulk Loader (DSBulk) is an open-source tool for loading into and unloading from Apache Cassandra®, DataStax Astra and DataStax Enterprise (DSE).

The DataStax Bulk Loader tool (DSBulk) is a unified tool for loading into and unloading from Cassandra-compatible storage engines, such as OSS Apache Cassandra®, DataStax Astra and DataStax Enterprise (DSE).

Out of the box, DSBulk provides the ability to:

  • Load (import) large amounts of data into the database efficiently and reliably;
  • Unload (export) large amounts of data from the database efficiently and reliably;
  • Count elements in a database table: how many rows in total, how many rows per replica and per token range, and how many rows in the top N largest partitions.
  • Currently, CSV and Json formats are supported for both loading and unloading data.

GitHub: https://github.com/datastax/dsbulk

41 questions
3
votes
1 answer

Is it possible to backup and restore Cassandra cluster using dsbulk?

I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why…
3
votes
0 answers

Loading json data into Cassandra using dsbulk

I feel like the documentation on loading json files into cassandra is really lacking in dsbulk docs. Here is part of the json file that im trying to load: [ { "tags": [ "r" ], "owner": { "reputation": 23, "user_id":…
Itzblend
  • 107
  • 2
  • 9
3
votes
2 answers

issue while loading data in cassandra using dsbulk

I’m facing issue while loading data into table from .csv file using dsbulk. I get like below in the errorlog. Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/10.0.126.13:9042] Timed out waiting for server response This…
v parkar
  • 39
  • 2
2
votes
1 answer

DSBulk loader version 1.8 : error in loading and connecting to Apache Cassandra

I installed Apache Cassandra and DSBulk loader as manual and all it's ok, but when I try to load data with DSBulk it seems to be a problem of connession between the db and the DSBulk. Someone can say to me what happened and how can I solve? That's…
2
votes
2 answers

How to load data into Apache Cassandra with Datastax Bulk loader (Ubuntu)?

When I want to upload data to my "Test Cluster" into Apache Cassandra I open the terminal and then: export PATH=/home/mypc/dsbulk-1.7.0/bin:$PATH source ~/.bashrc dsbulk load -url /home/mypc/Desktop/test/file.csv -k keyspace_test -t…
wundolab
  • 177
  • 10
2
votes
1 answer

Use dsbulk load in python

I created a Cassandra database in DataStax Astra. I'm able to connect to it in Python (using cassandra-driver module, and the secure_connect_bundle). I wrote a few api in my Python application to query the database. I read that I can upload csv to…
F.S.
  • 1,175
  • 2
  • 14
  • 34
2
votes
1 answer

DSBulk with ScyllaDB

I'm trying to use DSBulk to load data into ScyllaDB. I know officially DSBulk doesn't support Scylla, but I found a post of someone using it instead of cqlsh. When I'm trying to connect, I'm always getting this error init query OPTIONS: error…
JonathanChaput
  • 334
  • 1
  • 4
  • 9
1
vote
3 answers

Why do row counts per node differ for a 5-node cluster with a replication factor of 3?

I have 5 nodes of machines connected in a Cassandra distributed data system. I have setup the replication factor as 3. I have understood that for a replication of 3, the data will be spread across 3 nodes based on the coordinator nodes availability.…
prasanna
  • 45
  • 8
1
vote
2 answers

Is a CQL COUNT() on a single partition also an expensive operation?

I know Cassandra count() is an expensive operation as it needs a complete table scan. https://www.datastax.com/blog/running-count-expensive-cassandra But let's say, we have a table hotel with hotel_type as the partition key and we run query select…
1
vote
1 answer

What is the correct CSV format for tuples when loading data with DSBulk?

I recently started using Cassandra for my new project and doing some load testing. I have a scenario where I’m doing dsbulk load using CSV like this, $ dsbulk load -url -k -t -h -u -p -header…
Senthil
  • 45
  • 1
  • 7
1
vote
1 answer

cassandra dsbulk mapping failed

I am using dsbulk to load dataset into the datastax astra error message: my table structure: CREATE TABLE project( FL_DATE date, OP_CARRIER text, DEP_DELAY float, ARR_DELAY float, PRIMARY KEY ((FL_DATE), OP_CARRIER) ) WITH CLUSTERING ORDER…
james kam
  • 49
  • 3
1
vote
2 answers

AWS Keyspace DSBulk unload failed, "Token metadata not present"

Getting error when trying to unload or count data from AWS Keyspace using dsbulk. Error: Operation COUNT_20221021-192729-813222 failed: Token metadata not present. Command line: $ dsbulk count/unload -k my_best_storage -t book_awards -f…
1
vote
1 answer

Using DSBulk to load into a CQL set returns "Invalid set literal - bind variables are not supported inside collection literals"

I try to load with dsbulk a huge amount of data into a table wit a set using: dsbulk load test.json \ -h cassandra-db -u ... -p ... -k mykeyspace \ -query "update mykeyspace.mytable set value_s = value_s +{:value_s} where value_1=:value_1 and…
dank
  • 21
  • 3
1
vote
1 answer

How do I limit the files generated by DSBulk UNLOAD to just one CSV file?

I have run below command in EC2 instance to unload data from cassandra and store it at some place in EC2, But I observing that for each dsbulk unload command it generates 2 json files irrespective of how large or small the file size is. How do I…
Rahul Diggi
  • 288
  • 2
  • 16
1
vote
2 answers

How to include SQL select statement in dsbulk unload command

I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path…
Rahul Diggi
  • 288
  • 2
  • 16
1
2 3