DSBulk cannot connect to cluster to load CSV data

Question

I am trying to load csv files into cassandra cluster for which I am using dsbulk utility.I have a local copy of CSV file and trying to connect to remote cluster and load the CSV into the table. However, dsbulk is failing to recognise remote cluster address and saying

Could not reach any contact point, make sure you've provided valid addresses

and

Caused by: An existing connection was forcibly closed by the remote host.

I am using the same connection parameters from intellij to connect to cluster with sslenabled and it is working fine. Couldn't really figure why it is not working with dsbulk. Please find the application.conf for dsbulk and the command that I am trying to run

dsbulk {
  --dsbulk.connector.name = csv
  --dsbulk.connector.csv.url = <CSV_Path>
  --dsbulk.connector.csv.header true
  --datastax-java-driver.basic.contact-points = [ "169.XX.XXX.XX", "169.XX.XXX.XX", "169.XX.XXX.XX" ]
  --datastax-java-driver.advanced.auth-provider.username = <user_name>
  --datastax-java-driver.advanced.auth-provider.password = <pwd
  --dsbulk.schema.keyspace = <key space
  --dsbulk.schema.table = <table
  --datastax-java-driver.advanced.ssl-engine-factory.truststore-path = <cacerts path<br/>
  --datastax-java-driver.advanced.ssl-engine-factory.truststore-password = <pwd
  --datastax-java-driver.advanced.resolve-contact-points = true
}

commands :

$ dsbulk load -url CSV Path**

The above command doesn't recognize the application.conf properties and trying to connect to 127.0.0.1

Error :

[driver] Error connecting to Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=2c61adb4)

Not really sure what could be the issue for conf file not being used by dsbulk

$ dsbulk load -url CSV Path -k keysapce -t table -h "[ "169.XX.XXX.XX", "169.XX.XXX.XX", "169.XX.XXX.XX" ]" -u userName -p pwd

The above command fails to connect to the cluster nodes added explicitly. Error :

[driver] Error connecting to Node(endPoint=/169.XX.XXX.XX:9042, hostId=null, hashCode=2a38b2fe),
Suppressed: [driver|control|id: 0x17d0139b, L:/172.31.50.184:59702 - R:/169.XX.XXX.XXX:9042] Protocol initialization request, step 1 (OPTIONS): unexpected failure (com.datastax.oss.driver.api.core.connection.ClosedConnectionException: Unexpected error on channel).
     Caused by: Unexpected error on channel.
       Caused by: An existing connection was forcibly closed by the remote host.

dsbulk is retrying on all nodes and giving the same error.

Auth is redirecting to plain text which I believe will work for my use case

Username and password provided but auth provider not specified, inferring PlainTextAuthProvider

Could you please suggest on what is the problem with my config or my connection to the remote cluster.

My actual use case is to archive millions of records from Sybase to Cassandra every week for which I am trying to create a simple java utility that executes this dsbulk. Any other approach is also appreciated.

Many thanks in advance.

score 1 · Answer 1 · answered Mar 01 '23 at 03:21

1

Your application.conf file has incorrect contents. See this documentation on how to construct the config files.

answered Mar 01 '23 at 03:21

Madhavan

758
4
8

Hi Madhavan, thanks for replying. I have added driver related config in driver.conf and dsbulk related config in application.conf. I still see the `Could not reach any contact point, make sure you've provided valid addresses` error.
datastax-java-driver.basic.contact-points = [ "169.XX.XXX.XX:9042"]
datastax-java-driver.advanced.auth-provider.username = UN
datastax-java-driver.advanced.auth-provider.password = PWD
datastax-java-driver.advanced.ssl-engine-factory.truststore-path = PATH
changed app.conf also like above. Do you think I am still missing some – harish bollina Mar 01 '23 at 05:08
I was able to correct the config by adding SSL config and explcitly adding SSL engine factory in driver.conf worked for me. And I see the dsbulk is loading around 500 records per sec, any idea if we can increase the speed of this – harish bollina Mar 01 '23 at 10:40
Glad you're able to fix the configuration problem! If you're doing a single threaded operation like loading from a single file (i.e. `-url /path/to/a/single_file.csv`) then there is not much we could do here to improve the throughput. One thing to try would be to allocate more memory to the DSBulk process itself via `export DSBULK_JAVA_OPTS="-Xmx10G"` prior to running your load command. Try if that works out in your environment. – Madhavan Mar 01 '23 at 12:59

score 1 · Accepted Answer · answered Mar 01 '23 at 11:51

The problem is that you have not formatted the entries in the configuration file correctly so DSBulk cannot parse them. Since the configuration file is not usable, DSBulk defaults to connecting to localhost (127.0.0.1).

The correct format looks like this:

dsbulk {
   connector.name = csv
   schema.keyspace = "keyspacename"
   schema.table = "tablename"
}

Then you need to define the Java driver options separately which looks like this:

datastax-java-driver {
  basic {
    contact-points = [ "cp1", "cp2", "cp3"]
  }
  advanced {
    ssl-engine-factory {
      keystore-password = "keystorepass"
      keystore-path = "/path/to/keystore.file"
      class = DefaultSslEngineFactory
      truststore-password = "truststorepass"
      truststore-path = "/path/to/truststore.file"
    }
  }
}

If you don't configure SSL correctly then the driver will not be able to connect to any on the nodes which is the reason for those errors you mentioned.

Note that you can place the Java driver configuration in a separate driver.conf file but you need to make sure you reference it in the application configuration with the line:

include classpath("/path/to/driver.conf")

For details, see Using SSL with DSBulk. Cheers!

Hello Eric, Thanks for coming back, I did figured that and was able to run now. But I could see dsbulk is loading 500 records per second. Trying to optimize that by enabling BATCH. Do you suggest something here please. For 2 million records, dsbulk took 1hr to load it into cassandra with batch count 1 — harish bollina, Mar 01 '23 at 12:12
Please post a new question. If we keep responding to follow up questions in the comments section, it will be a never-ending thread. Cheers! — Erick Ramirez, Mar 01 '23 at 12:14

DSBulk cannot connect to cluster to load CSV data

2 Answers2

Linked