MarkLogic Content Pump , content_encoding encoding="US-ASCII"?

Question

MarkLogic is installed on Windows 10 machine.

We are using MarkLogic Content Pump (MLCP) to import data

It is working well with

<?xml version="1.0" encoding="UTF-8"?>

It is showing error while importing non UTF8 encoding i.e.

<?xml version="1.0" encoding="US-ASCII"?>

I looked at MLCP guide and found content_encoding parameter but its not working and throwing error for records contains special characters like ´ δ, “ & so on

ERROR mapreduce.ContentWriter: XDMP-DOCENTITYREF: Invalid entity reference "gamma"

I am passing it as follows

mlcp.bat -content_encoding "US-ASCII"

When i looked at this document, it says "Only UTF-8 is supported."

When i looked at this, it says "The option value must be a character set name accepted by your JVM;"

So i am confused and not sure how to solve this issue and how to set character set in JVM

If your XML document contains characters such as `δ`, then your encoding is not US-ASCII. If you declare that it is ASCII (single byte characterset) and then include content such as `δ` which is a multi-byte character, the XML parser will read each byte as a separate character, and you will get mangled garbage and the potential for these sort of errors. — Mads Hansen, Mar 04 '19 at 19:08
Thanks MH for your reply. I am working on legacy system which contains hundred thousands of files where character set is mentioned as US-ASCII and i can not change source, one option is to use transform content during ingestion but i want to avoid transforming and want easier solution available with MLCP — Manish Joisar, Mar 04 '19 at 19:23
It sounds like the encoding in the XML header is unreliable. In that case it will be hard to get it right. You could give `windows-1252` a go, which is a very typical encoding in Windows. If that fails as well, you can have MarkLogic take a guess by enabling the ` -xml_repair_level full` option. I would advice checking the results carefully though, you could end up with garbled characters, particularly for diacritics and special characters like you mention. — grtjn, Mar 05 '19 at 12:37
Thanks grtjn for your reply, -xml_repair_level full worked and i can import set of files with special characters. Need to check with more — Manish Joisar, Mar 11 '19 at 11:56
MLCP ran for xml successfully and we created small application to test whether all correct characters were ingested. It's not, for e.g. Å - Ã… and ö - ¶ instead of Å and ö respectively. Instead of actual character, If i can get somehow able to ingest entity itself in MarkLogic XML, then i think browser will manage with showing right ones. — Manish Joisar, Mar 22 '19 at 16:18
Issue was related to encoding, it set it to Encoding.UTF8 to WebClient object and it worked — Manish Joisar, Mar 25 '19 at 09:22

score 0 · Accepted Answer · answered Mar 11 '19 at 11:56

Thanks grtjn for your reply.

-xml_repair_level full worked and all records are now committed and no failed records.

Special characters (with ;) are stored in ML with real character as follows

&lambda - λ
&Aring - Å
&mu - μ

I am hoping that this should be acceptable content from business point of view.

Now only major challenge is to test with garbled characters in millions of xml records.

Thanks grtjn for your help.

MarkLogic Content Pump , content_encoding encoding="US-ASCII"?

1 Answers1