0

I am running dsbulk to load CSV into cassandra. I tried with a csv that has 2 million records and dsbulk took almost 1 hr 6 mins to load the file into DB.

    total | failed | rows/s |  p50ms |  p99ms | p999ms | batches
2,000,000 |      0 |    500 | 255.65 | 387.97 | 754.97 |    1.00

This is what I see from the console output. I am trying to increase the batches and also the no.of rows/sec. I have added maxConcurrentQueries and bufferSize but I still see dsbulk is starting with single batch and 500 rows/sec.

How can I improve the load performance for dsbulk?

James Z
  • 12,209
  • 10
  • 24
  • 44
  • What is the version of your source and target Cassandra clusters? What is the hardware specs of the machine where DSBulk is installed? Also, what is the output of running `./dsbulk --version` command? – Madhavan Mar 03 '23 at 20:53

2 Answers2

1

You need to post the full dsbulk load command that you used, source & target C* versions, your hardware specs to triage this efficiently.

Please see comments to this answer.

If you're doing a single threaded operation like loading from a single file (i.e. -url /path/to/a/single_file.csv) then there is not much we could do here to improve the throughput. One thing to try would be to allocate more memory to the DSBulk process itself via export DSBULK_JAVA_OPTS="-Xmx10G" prior to running your load command. Try if that works out in your environment and make sure your target cluster is able to handle the increased load.

Madhavan
  • 758
  • 4
  • 8
  • Hi Madhavan, my dsbulk command has basic parameters for connection and ssl. I am trying from a local machine with java Xmx as 4G and destination is a remote cluster. Not sure on the C* version. I am trying to upload from a single csv file. Just to understand there isn't any batching support for single file load..? trying with batchSize, concurrent requests and others. Don't really see a difference. Do you think there is a different way to do it where I can achieve the increased throughput – harish bollina Mar 02 '23 at 03:29
  • When you perform an `unload` operation, just specify a directory with `-url` so DSBulk would automatically split them into multiple files which you could point them during the `load` operation. Reg. batching, yes it has [default batching based on partition keys](https://docs.datastax.com/en/dsbulk/docs/reference/batchOptions.html#_batch_mode_dsbulk_batch_mode_string). Based on your local machine's specs & target cluster's sizing, you could chew more with [this parameter](https://docs.datastax.com/en/dsbulk/docs/reference/engineOptions.html#dsbulkEngineOptionsMaxConcurrentQueries). – Madhavan Mar 03 '23 at 12:16
0

I have tried using batching and other concurrent parameters with dsbulk but couldn't see any improvement. I have tried with datastax Cluster and Session api to create a session and used that session to execute batch statements.

cluster = cluster.builder().addContactPoints("0.0.0.0", "0.0.0.0")
            .withCredentials("userName","pwd")
            .withSSL()
            .build();
    session = cluster.connect("keySpace");
BatchStatement batchStatement = new BatchStatement();
batchStatement.add(new SimpleStatement("String query with JSON Data"));
session.execute(batchStatement);

I have used ExecutorService with 10 threads and each thread inserting 1000 queries per batch.

I have tried with something like above and it worked fine for my use case. I was able to insert 2 million records in 15 mins. I am creating insert queries using JSON keyword and creating json from the resultSet. We can also use executeAsync in which case you application thread will finish in a minute or two but cassnadra cluster still took the same 15 mins to add all the records.

To read data from source sybase DB, I have used jdbcTemplate.queryForList which will list records as List> and each object in that list is map which can be converted to JSNO using JSON ObjectMapper writeValueAsString method.

Hope this will be useful to someone.