How to import data into Cassandra on EC2 using DSBulk Loader

Question

I'm attempting to import data into Cassandra on EC2 using dsbulk loader. I have three nodes configured and communicating as follows:

UN  172.31.37.60   247.91 KiB  256          35.9%             7fdfe44d-ce42-45c5-bb6b-c3e8377b0eba  2a
UN  172.31.12.203  195.17 KiB  256          34.1%             232f7d98-9cc2-44e5-b18f-f52107a6fe2c  2c
UN  172.31.23.23   291.99 KiB  256          30.0%             b5389bf8-c0e5-42be-a296-a35b0a3e68fb  2b

I'm trying to run the following command to import a csv file into my database:

dsbulk load -url cassReviews/reviewsCass.csv -k bnbreviews -t reviews_by_place -h '172.31.23.23' -header true

I keep receiving the following error:

Error connecting to Node(endPoint=/172.31.23.23:9042, hostId=null, hashCode=b9b80b7)

Could not reach any contact point, make sure you've provided valid addresses

I'm running import from outside of the cluster, but within the same EC2 instance. On each node, I set the listen_address and rpc_address to its privateIP. Port 9042 is open - All three nodes are within the same region, and I'm using an Ec2Snitch. Each node is running on an ubuntu v18.04 server.

I've made sure each of my nodes is up before running the command, and that the path to my .csv file is correct. It seems like when I run the dsbulk command, the node that I specify with the -h flag goes down immediately. Could there be something wrong with my configuration that I'm missing? DSBulk loader worked well locally, but is there a more ideal method for importing data from csv files in an EC2 instance? Thank you!

EDIT: I've been able to load data in chunks using dsbulk loader, but the process is occasionally interrupted by this error:

[s0|/xxx.xx.xx.xxx:9042] Error while opening new channel

The way I'm currently interpreting it is that the node at the specified IP has run out of storage space and crashed, causing any subsequent dsbulk operations to fail. The work-around so far has been to clear excess logging files from /var/log/cassandra and restart the node, but I think a better approach would be to increase the SSD on each instance.

are you running import from inside the cluster, or outside? If outside, is it machine in EC2 as well? how the nodes are configured regarding listen_address/broadcast_address? Is the port 9042 open? check with `nc -vz IP 9042` from the machine where DSBulk is running — Alex Ott, Jun 03 '20 at 06:41
Hey Alex, thanks for the quick reply. I'm running import from outside of the cluster, but within the same EC2 instance. On each node, I set the listen_address and rpc_address to its privateIP. Port 9042 is open - I'm running three nodes, all within the same region, using Ec2Snitch. I haven't configured my broadcast_address since the nodes should be able to contact each other via their privateIPs since they're within the same region, I believe. — tpooch21, Jun 03 '20 at 18:47
Have you tried using an address translator? For DSBulk 1.4+: --driver.advanced.address-translator.class Ec2MultiRegionAddressTranslator and for DSBulk <1.4: --driver.addressTranslator EC2MultiRegionAddressTranslator — adutra, Jun 04 '20 at 08:19
Tervor, you can improve your question by adding the relevant info that you replied to @AlexOtt with to it. — Uberhumus, Jun 04 '20 at 09:35
Thanks Uberhumus, made those edits. @Alex, I'm using DSBulk v1.5.0 and Cassandra v3.11.6. adutra, I have not tried using an address translator, but will look into it. As mentioned in my edit, I think the problem stems from insufficient SSD capacity. — tpooch21, Jun 04 '20 at 17:42
error message shows that it can't do the initial connection to the cluster — Alex Ott, Jun 04 '20 at 18:08

score 0 · Accepted Answer · answered Jun 04 '20 at 22:43

As mentioned in my edit, the problem was solved by increasing the volume on each of my node instances. The reason that DSBulk was failing and causing the nodes to crash was due to the EC2 instances running out of storage, from a combination of imported data, logging, and snapshots. I ended up running my primary node instance, in which I was running the DSBulk command, on a t2.medium instance with 30GB SSD, which solved the issue.

How to import data into Cassandra on EC2 using DSBulk Loader

1 Answers1