Loading CSV (or TSV) into MarkLogic with automatic encoding

Question

I have successfully loaded a very clean (plain English, no fancy symbols or images) CSV file into MarkLogic using MLCP (MarkLogic Content Pump) so that it would take the first row as the column names, and I've learned that when I try to load something that it not clean (i.e. mixed with other languages and encoding) it fails.

I read from the Ingestion guide (http://docs.marklogic.com/guide/ingestion/encoding?print=yes) that encoding is not controllable with MLCP so I decided to give the Java API and the xdmp Xquery a try.

When using the Java API and I am getting: Invalid UTF-8 escape sequence at line 1549 -- document is not UTF-8 encoded

If I try loading it with xdmp in with automatic encoding in Query Console or in a flow on Information Studio, it loads without a problem but MarkLogic does not take the first row as column names, but it rather takes in the entire file as one document, which is not what I am looking for.

Is there a way to load the CSV file without the encoding problem and have it take in the first row as column names?

Thanks in advance.

Have you tried opening the file in an editor first, and forcing it to save as UTF-8? — wst, Apr 28 '14 at 21:06

mblakele · Answer 1 · 2014-04-29T13:58:37.997

3

RecordLoader can do that: http://marklogic.github.io/recordloader/

CONFIGURATION_CLASSNAME=com.marklogic.recordloader.xcc.DelimitedDataConfiguration
FIELD_DELIMITER=,
RECORD_NAME=my-root-element-name

Run recordloader.sh with those properties and your CSV file(s). RecordLoader will expect the first line to be a list of headers, and will turn those into element names. Adjust my-root-element-name to suit yourself, and set INPUT_ENCODING to whatever encoding you need.

See https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataConfiguration.java for more configuration options.

edited Apr 29 '14 at 13:58

answered Apr 29 '14 at 03:32

mblakele

7,782
27
45

`SEVERE: com.marklogic.recordloader.FatalException: com.marklogic.recordloader.LoaderException: document mismatch: fields=2, labels=19 at stdin:2:` How do I fix this? Am I supposed to be using the Java API or the shell script? – user3521239 May 01 '14 at 20:21
Support via StackOverflow sounds like a bad idea: please use https://github.com/marklogic/recordloader/issues/new where you have plenty of room to tell me all the gory details. – mblakele May 01 '14 at 21:36

Loading CSV (or TSV) into MarkLogic with automatic encoding

1 Answers1