Dummy's guide to Unicode

Question

Could anyone give me a concise definitions of

Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252

I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

score 19 · Accepted Answer · edited Sep 21 '09 at 15:14

19

This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

edited Sep 21 '09 at 15:14

Aaron Digulla

321,842
108
597
820

answered Sep 21 '09 at 15:00

Tim

20,184
24
117
214

3

The only caveat is that some of the information is out of date (unicode being a moving target), although nothing that the questioner really needs to care about for his level of interest – Kathy Van Stone Sep 21 '09 at 15:04
2

Actually, Joe's often referred article was not correct even at the date when it was published (2003). Correct UTF-8 does not go up to 6 bytes (only 4), there is such a thing as "plain text" (has nothing to do with the encoding), UCS is not Unicode lingo (is ISO lingo), wchar_t and L"Hello" is not necessarily Unicode. But hey, he knows more than others, even if some of it is wrong. The message is still the correct one :-) – Mihai Nita Nov 12 '09 at 07:49
@Mihai: ① UTF-8 can go up to 6 bytes per character, but currently only up to 4 are needed. Joel's table is quite clear on that. ② Which thing is “plain text”? A plain ASCII text? ③ no objections about UCS and lingo source ④ (nitpicking :) `wchar_t` and `L"Hello"` are **always** Unicode, obviously encoded. Of course, even `"Mihai Nita"` is Unicode, encoded as `ASCII` or `ISO8859-1` or `CP1252` or even `CP1253` or … – tzot Mar 08 '11 at 21:00
@tzot "UTF-8 can go up to 6 bytes per character": such an encoding is not conformant to UTF-8 as defined, and that is the case since Unicode 3 (http://unicode.org/faq/utf_bom.html#utf8-4). – Mihai Nita Apr 18 '12 at 22:10
@tzot "wchar_t and L"Hello" are always Unicode" : not according to the C/C++ standards. "The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers." http://en.wikipedia.org/wiki/Wide_character#C.2FC.2B.2B – Mihai Nita Apr 18 '12 at 22:10
@tzot "even "Mihai Nita" is Unicode, encoded as ASCII or ISO8859-1 or CP1252 or even CP1253". Incorrect. When something is encoded as CP1252 is not Unicode anymore. Any character can be represented as Unicode, some can be represented as CP1252. Both Unicode and cp1252 are ways to assign numbers to characters. Might overlap or not (for instance the Euro sign is in 80h in cp1252 but U+20AC in Unicode). It is a bit like the number 12 represented in various bases: 0x0C hex, 12 dec, 014 octal. 12 is a number and it's an abstraction. You can't say 12 is decimal, encoded as hex or octal. – Mihai Nita Apr 18 '12 at 22:19
@MihaiNita “Any character can be represented as Unicode” No, any character is either included in some version of Unicode and afterwards, or it isn't. You talk as if you believe that Unicode means the UCS-2 encoding, just like Microsoft implies in its operating systems. Yes, “both Unicode and CP1252 are ways to assign numbers to characters”, but **only** CP1252 is a way to encode characters to bytes. Unicode and UCS-4/UTF-32 are **not** the same thing. – tzot Apr 19 '12 at 06:58
1

http://www.utf8everywhere.org is a great sequel to Joel's article, about what you should really be doing in your application. – Pavel Radzivilovsky Sep 08 '12 at 23:15

score 13 · Answer 2 · edited Mar 08 '11 at 21:04

If you want a really brief introduction: Unicode in 5 Minutes

Or if you are after one-liners:

Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused

Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode

Þ (LATIN CAPITAL LETTER THORN)
ﬁ (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.

score 6 · Answer 3 · answered Sep 21 '09 at 15:09

6

As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...

answered Sep 21 '09 at 15:09

Jon Skeet

1,421,763
867
9,128
9,194

score 3 · Answer 4 · answered Sep 21 '09 at 15:11

Yea I got some insight but it might be wrong, however it's helped me to understand it.

Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.

UTF7, 8 16 etc are all just different codepages using different formats.

The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.

They also don't really differ from windows 1252 as that's just another codepage.

For a better smarter answer try one of the links.

score 2 · Answer 5 · answered Sep 21 '09 at 15:00

2

Here, read this wonderful explanation from the Joel himself.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

answered Sep 21 '09 at 15:00

Vineet Reynolds · Answer 6 · 2009-09-21T20:05:48.020

Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.

The Unicode FAQ is a good enough place to answer some (not all) of your queries.

A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.

By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.

Note that if the text in your application is stored with mark-up (HTML, XML or other similar), then often UTF-8 is more efficient even for Asian languages. For example, when dealing with the web, choosing to use UTF-8 uniformly throughout your workflow is totally reasonable. — MtnViewMark, Sep 22 '09 at 17:19
Yes, I agree with that notion for dealing with the web. However, for thick clients programmed in C/C++ etc. UTF-16 usually makes sense for a Asian language market. — Vineet Reynolds, Sep 22 '09 at 17:33

bersling · Answer 7 · 2021-09-21T22:32:55.270

I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!

Why should you care as a programmer?

It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.

Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!

Resources:

I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.

As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.

If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.

Dummy's guide to Unicode

7 Answers7