MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters

Question

Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like

(U+20BA2) 4bytes

and the xml string looks like

<City>City</City><Name></Name>

So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like

Invalid unicode characters, & #55362;&#57250

Can someone please tell me what I can do to resolve this issue?

Thanks,

Edited

The code looks like this

    Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)

score 1 · Accepted Answer · answered Apr 14 '12 at 10:22

1

I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element></element> in MSXML 4.0 SP3; this works fine.

I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>&#55362;&#57250;</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.

If is turned into a character reference it must be 𠮢 (or 𠮢). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.

answered Apr 14 '12 at 10:22

bobince

528,062
107
651
834

@user1317838: OK, nothing wrong with the code fragment, what exactly is in the `strXml` and how was it generated and loaded? – bobince Apr 16 '12 at 20:55
strXml is dynamically built based on form values a user submitted. I do escape a character by using charCodeAt(index). So, is that a culprit? – user1317838 Apr 17 '12 at 04:11
I see this - https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/String/charCodeAt – user1317838 Apr 17 '12 at 04:19
Yes, creating `...;` character references using `charCodeAt` directly is the wrong thing. JavaScript Strings are based on UTF-16 code units *not* characters, so the name `charCodeAt` is misleading (it's actually `codeUnitAt`). To get a full Unicode character from a pair of surrogates in JavaScript you have to do the maths manually, or prototype in support for a `fullCharCodeAt` method to `String`. Much better just not to character-reference-encode your non-ASCII characters at all: just add the raw string to the XML and let UTF-8 take care of them. – bobince Apr 17 '12 at 08:28
for your last comment, so in order for XML and UTF-8 to take care of non-ASCII characters, what do I need to do? Just declare xml encoding to UTF-8? Since, xml's default encoding is utf-8, I could just declare xml I guess. – user1317838 Apr 17 '12 at 12:04
Yeah, since UTF-8 is the default you don't need an XML Declaration at all. You do need to make sure you have a UTF-8-clean communications channel back to your ASP, by serving your HTML page as UTF-8 and if you decode requests then use UTF-8 as the encoding. (Don't know if this is VB.NET or classic ASP or...) But you probably should be doing that anyway. – bobince Apr 18 '12 at 08:26

score 0 · Answer 2 · answered Apr 13 '12 at 19:59

0

&#55362;&#57250; is the entity encoded form of 0xD842 0xDFA2, which is the UTF-16 encoded form of the Unicode character. Make sure that the XML is completely UTF-16 encoded, not mixed single-byte ASCII and multi-byte UTF-16.

answered Apr 13 '12 at 19:59

Remy Lebeau

555,201
31
458
770

What programing language are you using, what data type is `strXml` declared as, and how is it getting filled in with the XML content? – Remy Lebeau Apr 17 '12 at 02:41

MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters

2 Answers2