1

I'm parsing an XML file which can contain localized strings in different languages (at the moment its just english and spanish, but in the future it could be any language), the API for the XML parser returns all data within the XML via a char* which is UTF8 encoded.

Some manipulation of the data is required after its been parsed (searching within it for substrings, concatenating strings, determining the length of substrings etc.).

It would be convenient to use standard functions such as strlen, strcat etc. As the raw data I'm receiving from the XML parser is a char* I can do all manipulation readily using these standard string handling functions.

However these all of course make the assumption and requirement that the strings are NULL terminated. My question therefore is - if you have wide data represented as a char*, can a NULL terminator character occur within the data rather than at the end?

i.e. if a character in a certain language doesn't require 2 bytes to represent it, and it is represented in one byte, will/can the other byte be NULL?

Gruntcakes
  • 37,738
  • 44
  • 184
  • 378

2 Answers2

3

UTF-8 is not "wide". UTF-8 is multibyte encoding, where Unicode character can take 1 to 4 bytes. UTF-8 won't have zero terminators inside valid character. Make sure you are not confused on what your parser is giving you. It could be UTF-16 or UCS2 or their 4-byte equivalents placed in wide character strings, in which case you have to treat them as wide strings.

cababunga
  • 3,090
  • 15
  • 23
  • So if the parser returns UTF8 I can do manipuations on the UTF-8 data as char* then call a UTF8toUTF16 conversion function before displaying the strings? (The gui elements take uint16*(unsigned short) arguments). – Gruntcakes Jun 02 '11 at 18:31
  • Yes, but you won't know how many characters there are in your string while it's encoded in UTF-8. – cababunga Jun 02 '11 at 18:42
  • UTF-8 is 1 to 4 bytes, 5 and 6 byte encodings have been dropped as the range of Unicode codepoints does not require them. – Patrick Schlüter Jun 02 '11 at 18:55
  • Conversion function is [MultibyteToWideChar](http://msdn.microsoft.com/en-us/library/dd319072%28v=vs.85%29.aspx) in Windows world... I don't know what GUI toolkit you are using, but I would rather use something that takes wstring pointer... – Paweł Dyda Jun 02 '11 at 18:57
  • I'm really not sure where you are getting that http://en.wikipedia.org/wiki/Utf-8#Design For-byte encoding covers only Basic Multilingual Plane, all Supplementary spaces take the rest of the space http://en.wikipedia.org/wiki/Unicode_plane – cababunga Jun 02 '11 at 19:06
  • 2
    @cababunga: the BMP (up to 0xFFFF) is covered with 3 bytes, all of UTF-16 representable codepoints (up to 0x10FFFF) is covered with 4 bytes (up to 0x1FFFFF). Unicode has declared that they won't use codepoints larger than 0x10FFFF, and that Unicode UTF-8 is up to 4 bytes. I believe ISO-10646 UTF-8 is still up to 6 bytes, covering up to 0x7FFFFFFF (i.e. 31 bits). – ninjalj Jun 02 '11 at 19:26
  • Thanks for the explanation. I somehow missed that. – cababunga Jun 02 '11 at 19:47
  • 1
    I'm pretty sure ISO-10646 was amended too to fix this (remove the useless 5- and 6-byte sequences). In any case the IETF RFC forbids them too. – R.. GitHub STOP HELPING ICE Jun 02 '11 at 20:13
  • I don't need to know how many characters, but will need to know how many bytes. So that should be ok. Some of the strings may contain content like this "text {.text} text", and I need to search for '{', '{', '.' and perform splitting etc. Is it valid to search for those individual characters while in UTF8 (i.e. using array like notation if (string[n] == '{') ) if the data is in UTF8? – Gruntcakes Jun 02 '11 at 20:46
  • That should be ok. Any ASCII character has exactly same representation in UTF-8. – cababunga Jun 02 '11 at 20:56
  • 1
    @cababunga: more importantly, no ASCII character appears on the representation of non-ASCII characters. – ninjalj Jun 02 '11 at 21:03
  • @Teres: UTF-8 is designed to be compatible with legacy programs written for ASCII, so apart from counting characters, mostly everything can be done as in ASCII, _including_ zero-terminated strings (ASCII NUL only appears on UTF-8 to represent the character NUL), and hard-coded path separators (a previous version of UTF-8 was called UTF-FSS for FileSystem-Safe). – ninjalj Jun 02 '11 at 21:08
0

C distinguishes between between multibyte characters and wide characters:

  • Wide characters must be able to represent any character of the execution character set using exactly the same number of bytes (e.g. if 兀 takes 4 bytes to be represented, A must also take 4 bytes to be represented). Examples of wide character encodings are UCS-4, and the deprecated UCS-2.

  • Multibyte characters can take a varying number of bytes to be represented. Examples of multibyte encodings are UTF-8 and UTF-16.

When using UTF-8, you can continue to use the str* functions, but you have to bear in mind that they don't provide a way to return the length in characters of a string, you need to convert to wide characters, and use wcslen. strlen returns the length in bytes, not characters, which is useful in different situations.

I can't stress enough that all elements of the execution character set need to be represented into a single wide character of a predefined size in bytes. Some systems use UTF-16 for their wide characters, the result is that the implementation can't be conforming to the C standard, and some wc* functions can't possibly work right.

ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • The content output from the XML parser will be like "text1 {.text2} {text3}". I need to form strings like "text1 somethingelse text3" from that. So need to parse for '{', '}' and '.' and build up a new string as going along. If I treat the content other than '{', '}', '.' as a stream of bytes as opposed to characters I'm assuming I can use the strcpy, strcat etc. functions to build the result, then convert the result into utf16. – Gruntcakes Jun 02 '11 at 20:58
  • Yes, as I said the `str*` functions mostly work in UTF-8 with the same semantics, save for `strlen()` due to the fact that while `char` = `byte`, `multibyte char` ≠ `byte/char`. – ninjalj Jun 02 '11 at 21:02