18

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.

I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:

  1. What languages does UTF-8 not support?
  2. What advantages do UTF-16 and UTF-32 have over UTF-8?

If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!

James Oravec
  • 19,579
  • 27
  • 94
  • 160

2 Answers2

18

All three are just different ways to represent the same thing, so there are no languages supported by one and not another.

Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.

In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.

One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.

To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html

RichieHindle
  • 272,464
  • 47
  • 358
  • 399
  • Do you know if there are any database advantages of using one type over the other? – James Oravec Mar 27 '13 at 16:31
  • 2
    UTF-8 is more compact for predominantly-English text, so things are likely to be faster with that. You shouldn't find any database features that are available with one encoding and not another. – RichieHindle Mar 27 '13 at 16:37
15

UTF8 is variable 1 to 4 bytes, UTF16 2 or 4 bytes, UTF32 is fixed 4 bytes.

That is why UTF-8 has an advantage where ASCII are most prevalent characters, UTF-16 is better where ASCII is not predominant, UTF-32 will cover all possible characters in 4 bytes.

Leo Chapiro
  • 13,678
  • 8
  • 61
  • 92
  • Most of our sales will be form the US, so that will be our primary focus... e.g. I want speed. So based on this, I assume you agree UTF-8 would be the best choice for me? – James Oravec Mar 27 '13 at 16:25
  • 1
    So UTF-8 never takes more space than UTF-16, and UTF-16 never takes more space than UTF-32; furthermore UTF-8 is faster and usually less spacious with mainly ASCII-style strings than the other two as a whole, though the other two as a whole are faster than UTF-8 when dealing with mainly non-ASCII-style strings. Is this correct? What's the tradeoff between UTF-16 and UTF-32? – Panzercrisis Dec 11 '14 at 19:32