0

I am using mlcp v9.0.4 to load data into MarkLogic v9.0.9 and I am trying to figure out the following:

  1. If the csv file is not having data rows and has only the column names, the file never gets loaded. How can I overcome this and load the empty files?

  2. There is a different behaviour of mlcp when the input_file_path is a directory containing csvs vs input_file_path is a directory containing another directory.

Eg: if structure is /dir/dir1/*.csv, then input_file_path=/dir/dir1/ loads faster compared to input_file_path=/dir/ [ with other options set to default ]

What is the logic that mlcp is applying to do the load here? Should I change any options for both ways to give same result to me?

For point 1:

  1. I could add an empty row to the csv and load it, but I wouldn't want this approach.

  2. I tried using a transform module but that is slowing down the load.

For point 2: I have been trying by changing the mlcp options - batch_size, split_size, max_split_size, thread_count, thread_count_per_split as given in the marklogic docs using different combinations. However, I wonder if I am just beating around bush. I wanted to understand how mlcp treats the inputs under the hood.

For point 2: For a 128GB RAM server - Following are the details I tried

File/directory structure:

/dir/dir1/1.csv - 4 MB
/dir/dir1/2.csv - 10 MB
/dir/dir1/3.csv - 400 MB
/dir/dir1/4.csv - 3000 MB

Database configuration:

forest policy - bucket
locking - off
journaling - fast

options file for mlcp:

-generate_uri
true
-fast_load
true
-thread_count
32
-split_size
true
-max_split_size
94371840
-thread_count_per_split
1
-batch_size
100
-transaction_size
20
Bharadwaj
  • 93
  • 8

1 Answers1

0

For point 1) what would expect as the result of loading a file with no data rows ? Considering that the data model is such that 1 CSV 'row' == 1 ML Document. 0 CSV data 'rows' == ??? Documents ? are you expecting a number != 0 ?

For point 2) Could you share the performance difference you are seeing ? What is "loads faster" and what is the final result set look like ?

DALDEI
  • 3,722
  • 13
  • 9
  • point1) i would think of like or . I think, to know if the data was empty or if there was a problem in loading - should not make me go back to my logs everytime. point2) i could see a significant change in performance - 3 times faster. I will do another run and post the results here. However, my intention is to understand why is such difference and is it expected? – Bharadwaj Sep 10 '19 at 00:31