0

I have a C++ XML document, document A, that is retrieved from a database and placed in MSXML 4 as DOM Document. The document is in ISO-8859-1 encoding, and it has non-ASCII characters, such as é (0xE9 in ISO-8859-1). Some of the document A nodes are copied into a newly created MSXML document, document B, that we want with UTF-8 encoding because that's what the recipient expects. Creating document B and setting processing instruction with encoding as UTF-8 and then copying the node from the document A does not cause the é to be in UTF-8 format (0XC3 0XA9). Is there another way using MSXML to let it convert without using stylesheets? Some of the documents would be in megabytes and may add additional processing time. Is there a way to do it by manipulating the XML as flat string? We work in wchar_t based strings (we don't use MFC) and I have been looking into some Windows API but that seems to take regular char and I am not sure yet if we would lose anything, and that's what I will be testing.

Thanks, Niraj

Niraj
  • 66
  • 4
  • 1
    Msxml is a COM component. It uses the same string format as the rest of Windows, wide strings encoded in utf-16. So whatever the processing instruction in the XML file might be, you'll always get U+00E9 if the é is properly encoded in the file. If you want to convert it to utf-8 then that's up to you, use WideCharToMultiByte() with CP_UTF8. – Hans Passant Jul 22 '14 at 00:52
  • Thanks Hans, using WideCharToMultiByte() was part of the solution. The other part was putting it back into wide char as is without conversion, which worked and the XML document was in valid UTF-8 format. – Niraj Sep 08 '14 at 19:54

0 Answers0