1

There's tons of info about Unicode codeunits, codepoints, etc, but I'm still a bit fuzzy about converting combined characters, graphemes, etc using byte-streams (required by libiconv).

Currently I'm only interested in converting between UTF-8/UTF-16/UTF-32 using libconv's iconv(), which expects the byte-lengths of both source and destination buffers as arguments.

Question: Is there a safe way to calculate fast the maximum possible bytes-length of the target buffer, based on the already known bytes-length of the source buffer?

Let's say for example, converting from u16buf to u8buf with a known u16byteslen (excluding 0x0000-termination if any). In the worst-case scenario, there will be 1 two-byte unit per codepoint in the UTF-16 source buffer, corresponding to a 4 single-byte units per codepoint in the UTF-8 target buffer. Is that enough to safely assume that the UTF-8 target buffer can never be longer than 2 * u16lenbytes?

I've actually experimented with that and seems to work, but I'm not sure if I'm missing corner cases involving combined characters and grapheme clusters. My doubts come from my ignorance regarding how those things are converted across these 3 different encodings. I mean, is it possible for a grapheme to need say 3 UTF-16 codepoints but like 10 UTF-8 codepoints when converted?

In that case, doubling u16lenbytes wouldn't suffice, right? And if so, is there any other straight forward way to pre-calc the maximum length of the target buffer?

Harry K.
  • 560
  • 3
  • 7
  • 1
    Converting from one UTF scheme to another shouldn't change the codepoints at all. – Mark Ransom Jun 04 '21 at 14:13
  • 1
    @MarkRansom: Using `iconv`, converting to `UTF-16` or `UTF-32` will add a U+FEFF byte order mark to the beginning. Surrogate code points also often won't survive re-encoding. – Dietrich Epp Jun 04 '21 at 14:17
  • @DietrichEpp didn't know it would add a BOM, thanks. And surrogate points are only possible in UCS2 which wasn't part of the question, right? – Mark Ransom Jun 04 '21 at 14:31
  • @MarkRansom: Surrogate points are possible in a sequence of "code points", but not in a sequence of "Unicode scalar values". Just nitpicking terminology. – Dietrich Epp Jun 04 '21 at 14:44
  • @DietrichEpp "*Using `iconv`, converting to `UTF-16` or `UTF-32` will add a U+FEFF byte order mark to the beginning*" - only if you ask it to output a BOM. It is possible to convert without outputting a BOM. – Remy Lebeau Jun 04 '21 at 17:23
  • 1
    @RemyLebeau: That's backwards--`iconv` will add a BOM unless you specifically ask it not to by specifying e.g. `UTF-16LE` or `UTF-16BE` – Dietrich Epp Jun 05 '21 at 01:23
  • @DietrichEpp exactly. If you ask it to output to a general-purpose `UTF-16` or `UTF-32`, you are asking it to output a BOM to specify which it picked. If you ask it to output to specifically `UTF-16LE/BE` or `UTF-32LE/BE`, you are asking it not to output a BOM. – Remy Lebeau Jun 05 '21 at 01:28
  • Right, which is why I put `UTF-16` code tick marks, because it's an argument to iconv. "Converting to `UTF-16` will add a U+FEFF byte order mark." – Dietrich Epp Jun 05 '21 at 01:51

2 Answers2

6

Question: Is there a safe way to calculate fast the maximum possible bytes-length of the target buffer, based on the already known bytes-length of the source buffer?

Yes.

to UTF-8 to UTF-16 to UTF-32
from UTF-8 ×2 ×4
from UTF-16 ×1 ½ ×1
from UTF-32 ×1 ×1

You can calculate this yourself by breaking it down by code-point ranges. Pick a source and destination column, and find the largest ratio.

Code Point UTF-8 length UTF-16 length UTF-32 length
0000…007F 1 2 4
0080…07FF 2 2 4
0800…FFFF 3 2 4
10000…10FFFF 4 4 4

Combining characters and grapheme clusters do not affect anything. Encodings simply convert a sequence of Unicode scalar values to bytes, and they are very straightforward.

Note that you will need to add two extra bytes when converting to UTF-16, and four extra bytes when converting to UTF-32, since these encodings will add a BOM U+FEFF to the beginning of the text. (If you don’t want that, use one of the BOM-less encodings, like UTF-16BE or UTF-16LE.)

I mean, is it possible for a grapheme to need say 3 UTF-16 codepoints but like 10 UTF-8 codepoints when converted?

No. That would imply some other kind of conversion, like a decomposition. The number of scalar values input is equal to the number of scalar values output, with the possible addition of U+FEFF byte order mark at the beginning. (I say "scalar value" instead of "code point", because "scalar value" excludes surrogates. If you are transcoding text which might have errors or might be garbage data, it doesn’t change the size of the result.)

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • 1
    Beat me to it, but this is exactly what I was going to write, except that I was going to lead with the third paragraph. iconv doesn't do anything about combining characters, normalization, etc. when going from one Unicode encoding to another; it just re-encodes on a codepoint-by-codepoint basis. – hobbs Jun 04 '21 at 14:12
  • "_these encodings will add a BOM_" - Is that mandated by the UTF standard(s) or is that just how `iconv` does it per default? – Ted Lyngmo Jun 04 '21 at 14:28
  • Thank you for a great answer, plus these tables can prove really handy as a quick reference. I didn't know about the auto BOM injection either (also good to know!) – Harry K. Jun 04 '21 at 14:38
  • @TedLyngmo: UTF-16 really does need a BOM, because UTF-16 can either be little endian or big endian. Both byte orders are permitted! Without the BOM, you have no way to reliably decode it. If you want to exclude the BOM, there are encodings such as UTF-16LE and UTF-16BE which have *fixed* byte order and don't need the BOM. These are the recommended tags for BOM-less UTF-16 text from the Unicode organization. – Dietrich Epp Jun 04 '21 at 14:48
  • @DietrichEpp Yes, I know why the BOM is there and that there are two UTF-16 versions, the recommended version and the Windows LE version that is the one most commonly used of the two. I was just wondering if the standard requires the BOM since I couldn't find it now but think I read sometime back that it isn't - and that software dealing with BOM-less texts deduces which one it is by inspecting the byte data. Sooner or later an illegal combination appears if decoded by the wrong decoder. If no illegal combinations appear at all - it would have to be deduced by inspecting plausible combinations – Ted Lyngmo Jun 04 '21 at 14:53
  • "UTF-16LE" was what I've experimented with as source buffer, defined directly in the source code as a C11 u"string literal" which happens to get encoded in little-endian on that machine. Ofc I didn't prefix it with a BOM, but since you brought it up I'll keep it in mind when testing the other way (from utf-8 to utf-16/LE/BE). Thanks again! – Harry K. Jun 04 '21 at 15:01
  • @TedLyngmo, AFAIK that's only for UTF-8. – Harry K. Jun 04 '21 at 15:02
  • @HarryK. UTF-8 never needs a BOM (even though some Windows software adds it). The byte order in UTF-8 is fixed. – Ted Lyngmo Jun 04 '21 at 15:21
  • 1
    @TedLyngmo: According to the Unicode standard, BOM is optional when decoding UTF-16. However, the standard requires that UTF-16 text without a BOM be interpreted as **big endian.** See §3.10. – Dietrich Epp Jun 04 '21 at 15:22
  • 1
    @TedLyngmo, Yes, I was referring to your comment regarding BOM-less UTF-16 texts, responding that AFAIK BOM-less works only for UTF-8 (not UTF-16) – Harry K. Jun 04 '21 at 15:24
  • @DietrichEpp Brilliant! I think that is info that would make the answer even better. I can't upvote it more than once though :-) HarryK: Aha, ok, got it! – Ted Lyngmo Jun 04 '21 at 15:24
  • 1
    Obviously that's crazy, so from a practical perspective, any software which is encoding to UTF-16 would include a BOM. Like iconv. – Dietrich Epp Jun 04 '21 at 15:26
  • 1
    @DietrichEpp UTF-16 text without a BOM, _for which the sender did not attach a label that says UTF-16LE or UTF-16BE_, should be interpreted as big-endian. See RFC 2781 § 4.3. This does not hold when the endianness is known. For example, in the Windows API UTF-16 is in fact always UTF-16LE. – Bruno Haible Jun 05 '21 at 13:21
  • @Bruno: Please don’t tag me in a comment if you’re just going to rephrase something I wrote in a different comment, using a different reference that basically says the same thing. – Dietrich Epp Jun 05 '21 at 13:42
2

Unicode code points can be encoded:

  • UTF-8: 1, 2, 3, or 4 bytes
  • UTF-16: 2 or 4 bytes
  • UTF-32: 4 bytes
  • (obsolete): UCS-2: 2 bytes (but it requires two surrogates for some code points).

So, as first estimate, if you have the lenght of UTF-16 in byte, you can be safe by using such formula:

byte_len_utf8 = 4 * byte_len_utf16 / 2

But this is not a good way: we know better: UTF-8 is 4 byte length only if UTF-16 is 4 byte length. So we have two cases: 4 * len / 4 or 3 * len / 2.

So if on first formula we allocate the double of bytes (as you supposed), in the second formula, the maximum is just 1.5 time the number of byte. For the Chinese/Japanese/Korean, you are in such region of the codepoints.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • Thank you! Yes it makes perfect sense, my doubting was mostly rooted towards graphemes and combined characters. – Harry K. Jun 04 '21 at 14:42
  • 1
    "*[UCS-2] requires two surrogates for some code points*" - wrong. UCS-2 does not use any surrogates at all, it can only handle codepoints up to U+FFFF. It is UTF-16 that uses surrogates for codepoints > U+FFFF. That is why UTF-16 uses "*2 or 4 bytes*" per codepoint - 2 bytes when no surrogates are used, 4 bytes when surrogates are used. "*UTF-8 is 4 byte length only if UTF-16 is **two byte** length*" - also wrong, it is when UTF-16 is **4 bytes** (2 surrogates), see the chart in [Dietrich's answer](https://stackoverflow.com/a/67838690/65863). – Remy Lebeau Jun 04 '21 at 17:27
  • @RemyLebeau: you should be careful on pedantry. Unicode UCS-2 is not just obsolete, but no more used. Now we have a lot of standards with use officially UCS-2, but later (in extension, addendum, interpretation, etc.) they interpret them as Unicode Code Unit. See Javascript, JPEG, etc. In 2021 if I read UCS-2 (and not on historical documents), you should really read as "Uncode Code Units". Until you will convince ISO (and others) to update many standards. -- You are right about the last "typo" (the calculations used 4) – Giacomo Catenazzi Jun 07 '21 at 08:02