How do I limit the files generated by DSBulk UNLOAD to just one CSV file?

Question

I have run below command in EC2 instance to unload data from cassandra and store it at some place in EC2, But I observing that for each dsbulk unload command it generates 2 json files irrespective of how large or small the file size is.

How do I have control over how many files are generated? example, Suppose I want a particular dsbulk unload to generate 5 part files instead of 2?

dsbulk unload -k custdata -t orderhistory -h '172.xx.xx.xxx' -c json -url proddata/json/custdata/orderhistory/data

Erick Ramirez · Accepted Answer · 2022-07-02T00:41:06.313

2

The default behaviour for the DataStax Bulk Loader is to parallelise the tasks into multiple threads if the machine has multiple cores.

To limit the number of written files to a single CSV, set the file concurrency to 1 with:

$ dsbulk -maxConcurrentFiles 1 ...

Just be aware that this will limit the throughput of DSBulk since it will be single-threaded.

For details, see DSBulk Connector options. Cheers!

[UPDATED] Use with a single dash (-) in -maxConcurrentFiles as advised by Alex Dutra/DSBulk dev.

edited Jul 02 '22 at 00:41

answered Jun 23 '22 at 12:40

Erick Ramirez

13,964
1
18
23

Not working, Still it's generation 2 files after setting concurrency to 1: dsbulk unload --maxConcurrentFiles 1 -k custdata -t orderhistory -h '172.xx.xx.xxx' -c json -url proddata/json/custdata/orderhistory/data – Rahul Diggi Jul 01 '22 at 04:40
1

Are you sure it's generating 2 output files? Perhaps check the timestamps to make sure one of them wasn't generated from a previous run. Cheers! – Erick Ramirez Jul 01 '22 at 04:57
Yes, They are generating two files itself. Checked the timestamp aswell – Rahul Diggi Jul 01 '22 at 07:09
There is an error in the option: maxConcurrentFiles is a shortcut option and as such it should be introduced by a single dash: -maxConcurrentFiles 1 – adutra Jul 01 '22 at 07:12
I opened https://github.com/datastax/dsbulk/issues/433. – adutra Jul 01 '22 at 11:37

How do I limit the files generated by DSBulk UNLOAD to just one CSV file?

1 Answers1