1

I have run below command in EC2 instance to unload data from cassandra and store it at some place in EC2, But I observing that for each dsbulk unload command it generates 2 json files irrespective of how large or small the file size is.

How do I have control over how many files are generated? example, Suppose I want a particular dsbulk unload to generate 5 part files instead of 2?

dsbulk unload -k custdata -t orderhistory -h '172.xx.xx.xxx' -c json -url proddata/json/custdata/orderhistory/data
Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Rahul Diggi
  • 288
  • 2
  • 16

1 Answers1

2

The default behaviour for the DataStax Bulk Loader is to parallelise the tasks into multiple threads if the machine has multiple cores.

To limit the number of written files to a single CSV, set the file concurrency to 1 with:

$ dsbulk -maxConcurrentFiles 1 ...

Just be aware that this will limit the throughput of DSBulk since it will be single-threaded.

For details, see DSBulk Connector options. Cheers!

[UPDATED] Use with a single dash (-) in -maxConcurrentFiles as advised by Alex Dutra/DSBulk dev.

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • Not working, Still it's generation 2 files after setting concurrency to 1: dsbulk unload --maxConcurrentFiles 1 -k custdata -t orderhistory -h '172.xx.xx.xxx' -c json -url proddata/json/custdata/orderhistory/data – Rahul Diggi Jul 01 '22 at 04:40
  • 1
    Are you sure it's generating 2 output files? Perhaps check the timestamps to make sure one of them wasn't generated from a previous run. Cheers! – Erick Ramirez Jul 01 '22 at 04:57
  • Yes, They are generating two files itself. Checked the timestamp aswell – Rahul Diggi Jul 01 '22 at 07:09
  • There is an error in the option: maxConcurrentFiles is a shortcut option and as such it should be introduced by a single dash: -maxConcurrentFiles 1 – adutra Jul 01 '22 at 07:12
  • I opened https://github.com/datastax/dsbulk/issues/433. – adutra Jul 01 '22 at 11:37