1

My question is same as this unanswered question ?

How to read Unicode XML values with rapidxml

But the content of my XML is encoded in UTF-8. I am a newbie to MS Visual Studio, C++.

My question is, How do we read an UTF-8 string into a wchar_t type string ?

Say, I define a structure like this,

typedef struct{
    vector<int> stroke_labels;
    int stroke_count;
    wchar_t* uni_val;
}WORD_DETAIL;

and when I read the value from xml i use..

WORD_DETAIL this_detail;
this_detail.uni_val=curr_word->first_node("labelDesc")->first_node("annotationDetails")->first_node("codeSequence")->value();

But the utf-8 strings that are being stored are not as expected. They are corrupted characters.

My questions are:

  1. How can I use rapidxml to read Unicode/Utf-8 values ?
  2. Are there any more simple xml parsers that do the same thing ?
  3. Any example code will be deeply appreciated.

In section 2.1 here it is mentioned

"Note that RapidXml performs no decoding - strings returned by name() and value() functions will contain text encoded using the same encoding as source file."

If the encoding of my XML is UTF-8 , what is the best way to get the return value of ->value() function ?

Thanks in advance.

Community
  • 1
  • 1
Koustav Ghosal
  • 504
  • 1
  • 5
  • 16
  • Have you read section 1.2 of the documentation, http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1character_types_and_encodings? Seems that if you want to do UTF-8 to UTF-16 conversion you will have to do it yourself. But that's not very hard. – john Oct 01 '13 at 15:24
  • john : Please check my edit – Koustav Ghosal Oct 01 '13 at 17:21
  • 1
    Since you are using Windows I guess the simplest way to convert UTF-8 to UTF-16 would be to use the Windows function MultiByteToWideChar. You can find plenty of examples on the internet of this. – john Oct 02 '13 at 05:33

1 Answers1

3

Remember that RapidXML is an 'in-situ' parser: It parses the XML and modifies the content by adding null terminators in the correct places (and other things).

So the value() function is really just returning a char * pointer into your original data. If that's UTF-8, then RapidXML returns a pointer to a UTF-8 character string. In other words, you're already doing what you asked for in the question title.

But, in the code snippet you posted you want to store a wchar_t in a struct. First off, I recommend you don't do that at all, because of the memory ownership issues. Remember, you're meant to be using C++, not C. And if you really want to store a raw pointer, why not the UTF-8 one you already have? http://www.utf8everywhere.org/

But, because it's windows there's a (remote) chance you'll need to pass a wide char array to an API function. If so, you will need to convert UTF-8 to Wide chars, using the OS function MultiByteToWideChar

// Get the UTF-8
char *str = xml->first_node("codeSequence")->value();

// work out the size
int size = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);

// allocate a vector for that size
std::vector<wchar_t> wide(size);

// do the conversion
MultiByteToWideChar(CP_UTF8, 0, str, -1, &wide[0], size);
Roddy
  • 66,617
  • 42
  • 165
  • 277