I am using mlcp v9.0.4 to load data into MarkLogic v9.0.9 and I am trying to figure out the following:
If the csv file is not having data rows and has only the column names, the file never gets loaded. How can I overcome this and load the empty files?
There is a different behaviour of mlcp when the input_file_path is a directory containing csvs vs input_file_path is a directory containing another directory.
Eg: if structure is /dir/dir1/*.csv
, then input_file_path=/dir/dir1/
loads faster compared to input_file_path=/dir/ [ with other options set to default ]
What is the logic that mlcp is applying to do the load here? Should I change any options for both ways to give same result to me?
For point 1:
I could add an empty row to the csv and load it, but I wouldn't want this approach.
I tried using a transform module but that is slowing down the load.
For point 2: I have been trying by changing the mlcp options - batch_size, split_size, max_split_size, thread_count, thread_count_per_split as given in the marklogic docs using different combinations. However, I wonder if I am just beating around bush. I wanted to understand how mlcp treats the inputs under the hood.
For point 2: For a 128GB RAM server - Following are the details I tried
File/directory structure:
/dir/dir1/1.csv - 4 MB
/dir/dir1/2.csv - 10 MB
/dir/dir1/3.csv - 400 MB
/dir/dir1/4.csv - 3000 MB
Database configuration:
forest policy - bucket
locking - off
journaling - fast
options file for mlcp:
-generate_uri
true
-fast_load
true
-thread_count
32
-split_size
true
-max_split_size
94371840
-thread_count_per_split
1
-batch_size
100
-transaction_size
20