0

From what I understand, casual binary serialization implementations of non-static structures (like an array or a vector) would usually state the structure's "length" as the first word (usually a 64bit uint), then proceed to encode each entity's value, without separators (given the serialized subject data in each cell of the array is deterministic, so the binary parser doesn't need any lookahead or backtracking).

Would this behavior be the same, traditionally, for utf-8 strings? I can't see any other way for implementing a binary serialization for "unbounded" utf-8 strings, such that the parser wouldn't need backtracking (which can be really inefficient) or lookahead (which would also need excessive testing against various possibilities, also inefficient). My guess is that the "length" value would denote the number of characters, not the number of bytes, as the utf-8 encoding ranges from 1 to 4 bytes for each character, although the encoding itself denotes how many bytes exist in the character based on the first byte (eliminating backtracking and lookahead, per-character).

As an example, the octet stream for the string abc would be

[0,0,0,0,0,0,0,3,97,98,99]

where 0,0,0,0,0,0,0,3 denotes the uint64 length of the input string, abc.

Is my intuition correct, or is there something I'm missing?

Athan Clark
  • 3,886
  • 2
  • 21
  • 39
  • I do not understand what you are trying to achieve. UTF-8 is a binary encoding, so you should treat UTF-8 as binary sequence of bytes. If you want to have semantic values, you should use Unicode code point (so UTF-8 decoded). I think that mixing semantic with binary representation will cause just troubles. For efficiency you may look python: it look the higher code point, and then it decide if encoding as extended ASCII or array of 16 bit integer (sort of UTF-16) or array of 32 bit integer (sort of UTF-32). – Giacomo Catenazzi Feb 27 '19 at 19:02
  • 1
    @GiacomoCatenazzi I understand that if my binary string is assumed to be _entirely_ utf-8 encoded text data, then I do not need to supply a range, and just parse through to exhaustion, but in the course that some utf-8 data may be a field of a struct, for instance, I believe there needs to be a range parameter, which I indicate as the first word. I'm just wondering how this is usually achieved in most languages. – Athan Clark Feb 27 '19 at 20:16
  • 1
    Most of the languages doesn't work with encoded Unicode. Python, Javascript (and on some extend C on windows) uses often UCS-2 internally. Some languages (like usual C) it uses it just as zero terminated binary string. I would really avoid mixing encoding and semantic. In such cases you should parse the string and check if it is valid, and handle invalid cases [so you have nearly decoder, which discard results]. Note: byte length is unique, about unicode length, there are various interpretations: number of codepoints, or number of "printable characters" (an accented letter has length 1). – Giacomo Catenazzi Feb 28 '19 at 08:18
  • 1
    An array of bytes known to contain text encoded with UTF-8 is an invariant concept. It could be treated like any other array of bytes. The concept of string and UTF-8 string are not so universal in programming languages. In short, use byte count if that's what's needed. – Tom Blodget Mar 01 '19 at 01:44

1 Answers1

2

In UTF-8, the Unicode code point U+0000 (NUL) is encoded as a single byte of value zero. It does not occur in the encoding of any other code point in UTF-8, so a null-terminated byte string can be used without a preceding length as long as embedded NUL is not allowed in the sequence; Otherwise, a preceding length can also be used as you have shown in the question.

For example, the Unicode string "abcdéfg一二三四" is encoding as the hexadecimal bytes:

61 62 63 64 c3 a9 66 67 e4 b8 80 e4 ba 8c e4 b8 89 e5 9b 9b 00
a  b  c  d  é     f  g  一       二       三       四        ␀

UTF-8 doesn't need backtracking or lookahead since the lead byte of a sequence indicates the number of trailing bytes required for the code point:

61hex = 01100001bin (one-byte sequence)
c3hex = 11000011bin (two-byte sequence)
e4hex = 11100100bin (three-byte sequence)

Trailing bytes all begin with 10xxxxxxbin:

a9hex = 10101001bin (trailing byte)
b8hex = 10111000bin (trailing byte)
80hex = 10000000bin (trailing byte)

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Are null terminators common for arrays and other structures? I don't see any compromise for using a null terminator in the case of structures that can be dynamically allocated, like vectors. – Athan Clark Feb 28 '19 at 17:52
  • @AthanClark It's all up to the implementation. A C structure with a member like `char SerialNumber[32]` can write up to 31 characters plus a null. Another might use a variable length structure like `int len; char data[1];` where the structure is used as a header for a variable memory allocation, and `len` is used to indicate the actual length of `data`. Both are common. – Mark Tolonen Feb 28 '19 at 18:21