19

I have to distribute my app internationally.

Let's say I have a control (like a memo) where the user enters some text. The user can be Japanese, Russian, Canadian, etc. I want to save the string to disk as TXT file for later use. I will use MY OWN function to write the text and not something like TMemo.SaveToFile().

How do I want to save the string to disk? In UTF8 or UTF16 format?

Gabriel
  • 20,797
  • 27
  • 159
  • 293
  • 1
    You're saving a string by itself? I would say that the answer depends on the context. Do users have a file format they expect? RTF? HTML? XML? I don't think that performance or memory usage or disk usage is going to dictate the issue, I think user expectations and user experience (does it Just Work) is going to require that you find that kind of thing out directly from your users. And I doubt they care. They just want it to work. – Warren P Mar 22 '12 at 13:23
  • 1
    Worth a read: http://utf8everywhere.org/ – Arnaud Bouchez Dec 19 '16 at 08:19

3 Answers3

39

The main difference between them is that UTF8 is backwards compatible with ASCII. As long as you only use the first 128 characters, an application that is not Unicode aware can still process the data (which may be an advantage or disadvantage, depending on your scenario). In particular, when switching to UTF16 every API function needs to be adjusted for 16bit strings, while with UTF8 you can often leave old API functions untouched if they don't do any string processing. Also UTF8 does not depend on endianess, while UTF16 does, which may complicate string I/O.

A common misconception is that UTF16 is easier to process because each character always occupies exactly two bytes. That is, unfortunately, not true. UTF16 is a variable-length encoding where a character may either take up 2 or 4 bytes. So any difficulties associated with UTF8 regarding variable-length issues apply to UTF16 just as well.

Finally, storage sizes: Another common myth about UTF16 is that it is more storage-efficient than UTF8 for most foreign languages. UTF8 takes less storage for all European languages, which can be encoded with one or two bytes per character. Non-BMP characters take up 4 bytes in both UTF8 and UTF16. The only case in which UTF16 takes less storage is if your text mainly consists of characters from the range U+0800 through U+FFFF, where the characters for Chinese, Japanese and Hindi are stored.

James McNellis gave an excellent talk at BoostCon 2014, discussing the various trade-offs between different encodings in great detail. Even though the talk is titled Unicode in C++, the entire first half is actually language agnostic. A video recording of the full talk is available at Boostcon's Youtube channel, while the slides can be found on github.

ComicSansMS
  • 51,484
  • 14
  • 155
  • 166
29

Depends on the language of your data.

If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16. You will pay a penalty when reading the data as it will be / needs to be converted to UTF-16 which is the Windows default and used by Delphi's (Unicode) string.

If your data is mostly in non-western languages, UTF-8 can take more storage than UTF-16 as it may take up to 6 4 bytes per character for some. (see comment by @KennyTM)

Basically: do some tests with representative samples of your users' data and see which performs better, both in storage requirements and load times. We have had some surprises with UTF-16 being slower than we thought. The performance gain of not having to transform from UTF-8 to UTF-16 was lost because of disk access as the data volume in UTF-16 is greater.

Marjan Venema
  • 19,136
  • 6
  • 65
  • 79
  • 3
    Indeed the UT8 to UTF16 conversion overhead will almost always be negligible compared the the extra I/O overhead though, even when the data is stored on SSD. – Eric Grange Mar 22 '12 at 10:33
  • 7
    UTF-8 can at most take 4 bytes. Surrogate pairs in UTF-16 should not be converted to UTF-8 independently. – kennytm Mar 22 '12 at 13:42
  • @KennyTM: Can you provide a link to any resources with more information on that? It is contrary to the fact that the UTF-8 encoding allows for up to 6 bytes. So I would like to learn more about this. – Marjan Venema Mar 22 '12 at 19:36
  • 1
    @MarjanVenema: Tables 3-6 and 3-7 in http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf. The encoding allowing up to 6 bytes do not mean that is a well-formed sequence, as Unicode's value is at most 0x10ffff. – kennytm Mar 22 '12 at 19:56
10

First of all, be aware that the standard encoding under Windows is UCS2 (until Windows 2000) or UTF-16 (since XP), and that Delphi native "string" type uses the same native format since Delphi 2009 (string=UnicodeString char=WideChar).

In all cases, it is it unsafe to assume 1 WideChar == 1 Unicode character - this is the surrogate problem.

About UTF-8 or UTF-16 choice, it depends on the storage itself:

  • If your file is a plain text file (including XML) you may use either UTF-8 or UTF-16 - but you will have to use a BOM at the beginning of the file, otherwise applications (like Notepad) may be confused at opening - for XML this is handled by your library (if it is not, change to another library);
  • If you are sure that your content is mostly 7 bit ASCII, use UTF-8 and the associated BOM;
  • If your file is some kind of database or a custom binary format, certainly the best format is UTF-16/UCS2, i.e. the default Delphi 2009+ string layout, and certainly the default database API layout;
  • Some file formats require or prefer UTF-8 (like JSON or even SQLite3), even if UTF-8 files can be bigger than UTF-16 for Asiatic characters.

For instance, we used UTF-8 for our Client-Server framework, since we use JSON as exchange format (which requires UTF-8), and since SQlite3 likes UTF-8. Of course, we had to write some dedicated functions and classes, to avoid conversion to/from string (which is slow for the string=UnicodeString type since Delphi 2009, and may loose some data when used with string=AnsiString type before Delphi 2009. See this post and this unit). The easiest is to rely on the string=UnicodeString type, use the RTL functions which handles directly UTF-16 encoding, and avoid conversions. And do not forget about your previous question.

If disk space and read/write speed is a problem, consider using compression instead of changing the encoding. There are real-time compression around (faster than ZIP), like LZO or our SynLZ.

Community
  • 1
  • 1
Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • 2
    Windows switched to UTF-16 in Windows 2000, not XP. – Remy Lebeau Mar 22 '12 at 19:32
  • 2
    The Unicode standard recommends against the BOM for [UTF-8](http://en.wikipedia.org/wiki/UTF-8) – mjn Mar 24 '12 at 21:51
  • @mjn You are right, my remark was from the Windows world practice, which is not the official standard. But it is also both faster and easier to search for a BOM and not scan the whole context to be valid UTF-8 (or not). For instance, IMHO there is no easy way in Delphi RTL to check for UTF-8 validity. This is a trolling subject - see [this SO question](http://stackoverflow.com/questions/4907942/detecting-text-file-type-ansi-vs-utf-8) - just as anytime Windows does not follow a recommendation... – Arnaud Bouchez Mar 27 '12 at 12:12
  • @RemyLebeau I'm not sure plain Windows 2000 (without Service Pack) did handle surrogates and the whole UTF-8 encoding - see http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx But such plain Windows 2000 is deprecated anyway. Even the reference on Wikipedia about this point is dubious (related to SQL Server and UTF-8). – Arnaud Bouchez Mar 27 '12 at 12:16