Use dsbulk load in python

Question

I created a Cassandra database in DataStax Astra. I'm able to connect to it in Python (using cassandra-driver module, and the secure_connect_bundle). I wrote a few api in my Python application to query the database.

I read that I can upload csv to it using dsbulk. I am able to run the following command in Terminal and it works.

dsbulk load -url data.csv -k foo_keyspace -t foo_table \
-b "secure-connect-afterpay.zip" -u username -p password -header true

Then I try to run this same line in Python using subprocess:

ret = subprocess.run(
    ['dsbulk', 'load', '-url', 'data.csv', '-k', 'foo_keyspace', '-t', 'foo_table', 
     '-b', 'secure-connect-afterpay.zip', '-u', 'username', '-p', 'password', 
     '-header', 'true'],
    capture_output=True
)

But I got FileNotFoundError: [Errno 2] No such file or directory: 'dsbulk': 'dsbulk'. Why is dsbulk not recognized if I run it from Python?

A related question, it's probably not best practice to rely on subprocess. Are there better ways to upload batch data to Cassandra?

score 4 · Answer 1 · answered Aug 18 '20 at 20:47

4

I think it has to do with the way path is handled by subprocess. Try specifying the command as an absolute path, or relative like "./dsbulk" or "bin/dsbulk".

Alternatively, if you add the bin directory from the DS Bulk package to your PATH environment variable, it will work as you have it.

answered Aug 18 '20 at 20:47

Adam Holmberg

7,245
3
30
53

I've already added it to PATH. I added this line in .zshrc `export PATH=~/dsbulk-1.6.0/bin:$PATH` – F.S. Aug 18 '20 at 21:08
Never mind! I was running the command in Jupyter Notebook. Somehow it doesn't work. The `subprocess` works when I run it as an actual python script. I appreciate your help! – F.S. Aug 18 '20 at 21:20
Hey F.S. ! I am new to using cassandra and want to get an idea on how to use the dsbulk with python. Can you share me a sample code or a tutorial/blog? – Manu Vats Jul 18 '22 at 04:35

Use dsbulk load in python

1 Answers1