Logic behind converting a character to UTF-8

Question

I have the following piece of code which the comment in code says it converts any character greater than 7F to UTF-8. I have the following questions on this code:

if((const unsigned char)c > 0x7F)
  {
    Buffer[0] = 0xC0 | ((unsigned char)c >> 6);
    Buffer[1] = 0x80 | ((unsigned char)c & 0x3F);
    return Buffer;
  }

How does this code work?
Does the current windows code page I am using has any effect on the character placed in Buffer?

1. It... works as defined by the UTF-8 encoding? How else would it? — R. Martinho Fernandes, Aug 01 '13 at 15:34
@R.MartinhoFernandes: I guess so..this code was not written by me. Its been working for a while now. So I guess its correct. I wanted to understand the logic behind it. — Asha, Aug 01 '13 at 15:35
@Asha there's not much to understand, assuming you know what the `|` and `>>` operators do (if not, it should be easy to find in some C++ learning material). The UTF-8 specification says where each of the bits needs to go, and that code simply puts all the bits where they need to be. — R. Martinho Fernandes, Aug 01 '13 at 15:42
@R.MartinhoFernandes Except when it doesn't. The code supposes that the single byte encoding is Latin-1, which has been largely superceded by Latin-15. (I'm also curious about `Buffer`, and the fact that he returns a pointer to it, and the fact that it isn't `'\0'` terminated. I'd be very suspicious of this code.) — James Kanze, Aug 01 '13 at 15:47
@JamesKanze, Latin-15 is very theoretical and little used in practie, and this has really nothing to do with the question. — Jukka K. Korpela, Aug 01 '13 at 17:28
@JukkaK.Korpela It's true that by the time Latin-15 came along, UTF-8 was already making great strides. Never the less, it is the most widely used single byte encoding in the areas I've worked. As to the question: it was "how does this work?", to which the only correct answer is "it doesn't", at least in general, and no competent programmer would write such junk. — James Kanze, Aug 02 '13 at 08:06
@James if you mean ISO-8859-15, I think that's the confusingly called Latin-9 (because the standard is named "Information technology — 8-bit single-byte coded graphic character sets — Part 15: *Latin alphabet No. 9*"). — R. Martinho Fernandes, Aug 02 '13 at 10:39
@R.MartinhoFernandes That's what I mean, yes. I almost always use the ISO-8859 names, but shorted it here because I feared the character limits. (Not that "Latin" is that much shorter than "ISO-8859".) — James Kanze, Aug 02 '13 at 11:51

score 10 · Accepted Answer · answered Aug 01 '13 at 15:45

For starters, the code doesn't work, in general. By coincidence, it works if the encoding in char (or unsigned char) is ISO-8859-1, because ISO-8859-1 has the same code points as the first 256 Unicode code points. But ISO-8859-1 has largely been superceded by ISO-8859-15, so it probably won't work. (Try it for 0xA4, for example. The Euro sign in ISO-8859-15. It will give you a completely different character.)

There are two correct ways to do this conversion, both of which depend on knowing the encoding of the byte being entered (which means that you may need several versions of the code, depending on the encoding). The simplest is simply to have an array with 256 strings, one per character, and index into that. In which case, you don't need the if. The other is to translate the code into a Unicode code point (32 bit UTF-32), and translate that into UTF-8 (which can require more than two bytes for some characters: the Euro character is 0x20AC: 0xE2, 0x82, 0xAC).

EDIT:

For a good introduction to UTF-8: http://www.cl.cam.ac.uk/~mgk25/unicode.html. The title says it is for Unix/Linux, but there is very little, if any, system specific information in it (and such information is clearly marked).

Logic behind converting a character to UTF-8

1 Answers1