11

I'm really confused about UTF in Unicode.

there is UTF-8, UTF-16 and UTF-32.

my question is :

  1. what UTF that are support all Unicode blocks ?

  2. What is the best UTF(performance, size, etc), and why ?

  3. What is different between these three UTF ?

  4. what is endianness and byte order marks (BOM) ?

Thanks

Ahmad
  • 4,224
  • 8
  • 29
  • 40
  • 6
    There is also UTF-7. But why don't you read some [W](http://en.wikipedia.org/wiki/Unicode)[i](http://en.wikipedia.org/wiki/Unicode_plane)[k](http://en.wikipedia.org/wiki/UTF-7)[i](http://en.wikipedia.org/wiki/UTF-8)[p](http://en.wikipedia.org/wiki/UTF-16)[e](http://en.wikipedia.org/wiki/UTF-32)[d](http://en.wikipedia.org/wiki/Byte-order_mark)[i](http://en.wikipedia.org/wiki/Little-endian)[a](http://en.wikipedia.org/wiki/Big-endian) articles on the topic to understand this? This is not a programming question per se. – Lucero Jul 30 '11 at 09:42
  • 2
    And [UTF-9 and UTF-18](http://tools.ietf.org/html/rfc4042) ;-) – Joey Jul 30 '11 at 09:59
  • @Lucero : That's what i'm asking hire, because after reading in wiki, i'm still confused :D. i thinks my question is related with programming topic, which is important for us to understand before writing code. – Ahmad Jul 30 '11 at 10:06
  • @Joey : ya, you right :D for not let focus on 8,16, and 32 lol. more utf make me double confused :D – Ahmad Jul 30 '11 at 10:08
  • @Lucero: +1 for ninja linking. – Kos Jul 30 '11 at 10:18
  • 1
    UTF-anything suffices for all Unicode code points, provided you do it right (many do not). Your question is overly broad and imprecise. What programming language are you using, and what is it that you are trying to do? – tchrist Jul 30 '11 at 14:21

6 Answers6

28

what UTF that are support all Unicode blocks ?

All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.

What is the best UTF(performance, size, etc), and why ?

For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).

What is different between these three UTF ?

UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.

UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.

UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.

what is endianness and byte order marks (BOM) ?

Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.

In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.

A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.

A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.

Benefits of UTF-8

  • ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
  • UTF-8 text is the most compact format for ASCII text
  • Valid UTF-8 can be sorted on byte values and result in sorted codepoints

Benefits of UTF-16

  • UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
  • UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII

Benefits of UTF-32

  • UTF-32 is not variable-length, so it requires no special logic to decode
Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
  • 3
    The UTF-8 “BOM” is used as an indicator of UTF-8 text, though. While it isn't technically a BOM it becomes more of a marker. Note that UTF-32 is never more space-efficient than UTF-8 and UTF-16. It also isn't usually used in interchance, but rather as an internal encoding because of the benefit you mention later. – Joey Jul 30 '11 at 10:01
  • UTF-8 BOMs and their use as UTF-8 identification are more of an informal, heuristic thing though, rather than a formally defined recommendation. Nevertheless, I see what you mean, and you are correct. I also agree with your comment about UTF-32; the only real place I see it being used is an intermediate format in RAM, so that string operations carried out are simpler and faster. Once the text is finished processing, it should probably be turned back into UTF-8 or UTF-16. – Delan Azabani Jul 30 '11 at 10:05
  • Ah, indeed, you're right. The Unicode Standard says “Use of a BOM is neither required nor recommended for UTF-8” but continues to say that it can be encountered sometimes. – Joey Jul 30 '11 at 10:16
  • thanks for you answer, so even UTF-8 can store non BMP ? – Ahmad Jul 30 '11 at 10:23
  • Yes, that is true. A non-BMP character will be stored in four values, and therefore, four bytes, of UTF-8. – Delan Azabani Jul 30 '11 at 10:28
  • actually now i'm creating application for language detection, what do you think ? what is the best UTF i must use ? :) Thanks – Ahmad Jul 30 '11 at 10:57
  • In what aspect of your program are you making a choice on the UTF to use? – Delan Azabani Jul 30 '11 at 10:59
  • i don't know what you mean :D – Ahmad Jul 30 '11 at 11:44
  • 1
    @Ahmad Your question as to “what UTF to use” is confusing, because usually you don’t have much choice in the matter internally, which depends on programming language, as I point out in my [Unicode Support Shootout](http://training.perl.com/OSCON2011/index.html) talk covering 7 programming languages. If I’d freedom to choose, I’d use UTF-8 exclusively. Arguments exist for UTF-32 internally, but I don't know if those stand up to careful scrutiny, since we lived with O(N) `strlen` forever, and because it costs memory. I don’t like UTF-16 due to language support and worst-of-both-worlds aspects. – tchrist Jul 30 '11 at 14:12
  • @Delan The main problem with a BOM is that it is sentinal/signalling metadata lurking at the start of the data stream. If you write a string of 10 code points out to a file, the UTF-16 and -32 versions get 11 apparent code points, which must be handled in the reader. The UTF-8 is free of that. Or should be. Windows bumbleware that mistakenly expands the UTF-8 to 11 code points is one of my pet peeves. – tchrist Jul 30 '11 at 14:19
  • Another disadvantage of UTF-16 is that unlike UTF-8 and UTF-32BE its binary representation doesn't sort like it's logical meaning. You have to decode it to sort it on code-points. – Leon Timmermans Nov 21 '12 at 14:34
18

“Answer me these questions four, as all were answered long before.”

You really should have asked one question, not four. But here are the answers.

  1. All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:

    • UCS-2 is not a valid Unicode encoding.
    • UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.
    • UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include
      • all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF
      • the 32 non-character code points between U+FDD0 and U+FDEF
      • both sentinels at U+FFEF and U+FFFF

    For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.

  2. UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:

    • UTF-8 is the de facto standard Unicode encoding on the web.
    • UTF-8 can be stored in a null-terminated string.
    • UTF-8 is free of the vexing BOM issue.
    • UTF-8 risks no confusion of UCS-2 vs UTF-16.
    • UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.
    • If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.

    I use UTF-8 whenever I can get away with it.

  3. I have already given properties of UTF-8, so here are some for the other two:

    • UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s strlen function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.
    • UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there.

    That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.

  4. Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.

I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.


Category Errors

Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.

If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, re, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.

For example, Unicode defines U+03C3 GREEK SMALL LETTER SIGMA and U+03C2 GREEK SMALL LETTER FINAL SIGMA as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.

(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous regex library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard re gets miserably wrong.)

REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • +1 for the concise and bolded advice about codepoint, not code unit interfaces. Well said. – Ray Toal Jul 30 '11 at 17:11
  • Side note: I'd be very interested to hear about the problems with python 2.x. My only awareness of a concrete difference is whether string literals default to bytestrings or unicode strings. You skipped python 2.x in your talk to keep the contents PG-13, but as you'll recall, almost everyone there was still stuck on python 2 for one reason or another. – ArthurDenture Jul 31 '11 at 15:57
  • @ArthurDenture No, you don’t want to hear about those. I find the Python 2 character model completely odious. You have to say `-*- coding: UTF-8 -*-` but then for no reason it **ALSO** makes you say `u"..."` on all your strings. Its error reporting is completely broken as to what column you’re on. You also have to do manual `encode("utf8")` on each and every output string. It can’t do casefolding on anything but ASCII alone, which makes it completely useless. There is a huge chance of screwing up and you won’t find out till too late. It is a miserable experience. – tchrist Jul 31 '11 at 17:42
  • @ArthurDenture Python2 fails 26% (71/270) of its casemapping and casefolding tests, whereas Python3 fails only 13% (36/270) of them using the same dataset. It shouldn’t fail any of them, but at least it’s getting better. Python2’s failure breakdown is 3/7/9 on lower/title/upper casemappings vs Python3’s 2/7/8. Casefolding failures are 11/18/23 under Python2 but only 2/5/12 under Python3. Either way you slice it, it’s not reliable enough for solid Unicode work. – tchrist Jul 31 '11 at 19:00
  • @ArthurDenture If you’re terribly eager, I’ve been hacking on the python casing tests, both [for Python 2](http://training.perl.com/OSCON2011/case-test.python2) and [for Python 3](http://training.perl.com/OSCON2011/case-test.python3). There’s a bit of noise, but they seem to have stabilized at about 30% and 15% failures respectively. Make of it what you will, but what I find most interesting is all the differences between v2 and v3. – tchrist Jul 31 '11 at 20:28
  • @ArthurDenture: I updated those scripts so they're less dumb. The bottom line is that Python2 fails 50% of the case-insensitive matches if you use the standard re library, but only 8% of them using mrab's regex library and then only on a narrow build. Python2 has lots of other casemapping failures, though. Send me mail if you're still interesting in this. – tchrist Aug 01 '11 at 15:15
  • 1
    Perhaps you should either link or concisely explain what code point vs code unit means before advising on using code points; from what I can see it's not obvious from your answer (well, it is obvious for those who know what the terms mean, but for the newbie it's just "Use Blargh instead of Bluff"). I personally found http://www.icu-project.org/docs/papers/forms_of_unicode/#h0 helpful. – unhammer Jun 04 '12 at 10:39
  • btw, the icu page also gives some nice examples of why rules are needed; the Tamil example is an eye-opener. – unhammer Jun 04 '12 at 10:50
6
  1. All of them support all Unicode code points.

  2. They have different performance characteristics - for example, UTF-8 is more compact for ASCII characters, whereas UTF-32 makes it easier to deal with the whole of Unicode including values outside the Basic Multilingual Plane (i.e. above U+FFFF). Due to its variable width per character, UTF-8 strings are hard to use to get to a particular character index in the binary encoding - you have scan through. The same is true for UTF-16 unless you know that there are no non-BMP characters.

  3. It's probably easiest to look at the wikipedia articles for UTF-8, UTF-16 and UTF-32

  4. Endianness determines (for UTF-16 and UTF-32) whether the most significant byte comes first and the least significant byte comes last, or vice versa. For example, if you want to represent U+1234 in UTF-16, that can either be { 0x12, 0x34 } or { 0x34, 0x12 }. A byte order mark indicates which endianess you're dealing with. UTF-8 doesn't have different endiannesses, but seeing a UTF-8 BOM at the start of a file is a good indicator that it is UTF-8.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
3

Some good questions here and already a couple good answers. I might be able to add something useful.

  1. As said before, all three cover the full set of possible codepoints, U+0000 to U+10FFFF.

  2. Depends on the text, but here are some details that might be of interest. UTF-8 uses 1 to 4 bytes per char; UTF-16 uses 2 or 4; UTF-32 always uses 4. A useful thing to note is this. If you use UTF-8 then then English text will be encoded with the vast majority of characters in one byte each, but Chinese needs 3 bytes each. Using UTF-16, English and Chinese will both require 2. So basically UTF-8 is a win for English; UTF-16 is a win for Chinese.

  3. The main difference is mentioned in the answer to #2 above, or as Jon Skeet says, see the Wikipedia articles.

  4. Endianness: For UTF-16 and UTF-32 this refers to the order in which the bytes appear; for example in UTF-16, the character U+1234 can be encoded either as 12 34 (big endian), or 34 12 (little endian). The BOM, or byte order mark is interesting. Let's say you have a file encoded in UTF-16, but you don't know whether it is big or little endian, but you notice the first two bytes of the file are FE FF. If this were big-endian the character would be U+FEFF; if little endian, it would signify U+FFFE. But here's the thing: In Unicode the codepoint FFFE is permanently unassigned: there is no character there! Therefore we can tell the encoding must be big-endian. The FEFF character is harmless here; it is the ZERO-WIDTH NO BREAK SPACE (invisible, basically). Similarly if the file began with FF FE we know it is little endian.

Not sure if I added anything to the other answers, but I have found the English vs. Chinese concrete analysis useful in explaining this to others in the past.

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • **NB: UTF-16 often takes more space than UTF-8 on Asian markup.** This is because with Asian texts presented in a markup language like XML or HTML, the ASCII-only markup tags dominate and therefore make the text take fewer bytes when rendered in UTF-8 than in UTF-16. I simply always use UTF-8 for everything. It makes life easier that way. UTF-16 has all bad qualities of both UTF-8 (variable-width encoding) and UTF-32 (requires meaningless BOM metadatum) with none of the advantages of either. The only thing worse than UTF-16 is UCS-2, which is no more of a Unicode encoding than ASCII is. – tchrist Jul 30 '11 at 14:05
  • Thanks for pointing this out, @tchrist. It's important and both points are true: When markup tags dominate UTF-8 can be smaller and UTF-16 is still variable-length and has endianness variation. The "UTF-8 for everything" heuristic is something I follow too. But in theory anyway all CJK characters (in the range U+2e80..U+9fff) do take 3 bytes in UTF-8, so for globs of CJK with no Latin, UTF-16 is smaller. FWIW. Just talkin' theory.... :) – Ray Toal Jul 30 '11 at 17:08
  • 2
    See [this answer](http://stackoverflow.com/questions/6883434/why-is-there-so-much-overhead-when-we-decide-to-use-utf-8-for-characters-outside/6884648#6884648) that mythbusts the notion that you suffer a 50–100% size increase to use UTF-8 instead of UTF-16 on Eastern text. That is on plaintext, not even on markup. – tchrist Jul 30 '11 at 17:21
  • @tchrist - upvoted and favorited. Thanks for the detailed case study in that answer. People _do_ have to be aware of the reality that real-life CJK has more whitespace, separators (and often ASCII markup from HTML, JSON, XML, etc.) so that the 3-byte vs. 2-byte comparison is theoretical only and should never be taken as an excuse to adopt UTF-16. These explanations are a great service. Thanks again for the case studies! – Ray Toal Jul 30 '11 at 17:52
2

One way of looking at it is as size over complexity. Generally they increase in the number of bytes they need to encode text, but decrease in the complexity of decoding the scheme they use to represent characters. Therefore, UTF-8 is usually small but can be complex to decode, whereas UTF-32 takes up more bytes but is easy to decode (but is rarely used, UTF-16 being more common).

With this in mind UTF-8 is often chosen for network transmission, as it has smaller size. Whereas UTF-16 is chosen where easier decoding is more important than storage size.

BOMs are intended as information at the beginning of files which describes which encoding has been used. This information is often missing though.

Tim Lloyd
  • 37,954
  • 10
  • 100
  • 130
  • This is mildly incorrect. BOMs do not indicate the encoding *per se*, but rather its endianness. They are an internal metadatum. It is much better to indicate the encoding through some auxiliary metadata property as occurs with HTTP, but this is fragile in case the dataset is stored without its corresponding metadata intact. – tchrist Jul 30 '11 at 15:02
  • @tchrist Agreed, you are quite right, but they are often used to infer the encoding. However, they are often missing. – Tim Lloyd Jul 30 '11 at 17:37
  • I don’t believe UTF-16 is ever chosen for the reason you state, that is, because it is “easier” to decode than UTF-8. That smells like a red herring, and one that’s been dead for long enough for its fragrance to have blossomed into putrescence. I believe UTF-16 is only ever chosen when some sort of binary compatibility with the legacy UCS-2 charset is desired. I call it *The UTF-16 Curse.* It is a very sad legacy that becomes a family curse. – tchrist Jul 30 '11 at 17:47
  • @tchrist ever, really? I agree it makes less sense these days. – Tim Lloyd Jul 30 '11 at 18:29
  • That’s my impression, yes. I may be wrong, but that is what I come away with after looking at a whole slew of programming languages. – tchrist Jul 30 '11 at 18:36
2

Joel Spolsky wrote a nice introductory article about Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Michael Klement
  • 3,376
  • 3
  • 30
  • 34
  • -1? Why? At least leave a comment. – Michael Klement Jul 31 '11 at 08:05
  • 1
    I didn't downvote, but I must say that Joel's article is not great. The only sources of truth are Tom Christiansen and the Unicode standard, everybody else is mostly wrong. Joel's article misses the point that Unicode is several orders of magnitude more than a blown-up version of ASCII, and the pieces it describes are the trivial ones that make up one millionth or so of the Unicode standard (but everyone except Perl even gets those trivial pieces wrong). – Philipp Jul 31 '11 at 12:18
  • @Philipp: Okay, thank you for the explanation. Obviously, I'm no Unicode expert, so I won't argue with that :) – Michael Klement Jul 31 '11 at 15:40