Using DSBulk for backup/restore takes too long

Question

I use dsbulk for text based backup and restore of cassandra cluster. I have created a python script that backsup/restores the all the tables in cassandra cluster using dsbulk load/unload but it takes long time even for less data due to new session created for each table (approx 7s), In my case I have 70 tables, so 70*7s is added due to session creation. Is there a way to backup data from all tables in a cluster using a single session using dsbulk? From the docs, I see dsbulk is suitable only for single table load/unload at a time. Is there any alternative or other approach for this? Please suggest if any..!

Thanks..

score 0 · Answer 1 · answered Nov 15 '21 at 02:56

No, there isn't a way to load/unload multiple tables in a single DSBulk execution because it doesn't make sense to do so.

In any case, using unloading data to CSV isn't recommended as a means of backing up your cluster because there are no guarantees that the data will be consistent at a point in time.

The correct way of backing up a Cassandra cluster is using the nodetool snapshot command. For details, see Apache Cassandra Backups.

If you're interested, there is an open-source tool which allows you to automate backups -- https://github.com/thelastpickle/cassandra-medusa. Cheers!

Using DSBulk for backup/restore takes too long

1 Answers1