From what I understand, casual binary serialization implementations of non-static structures (like an array or a vector) would usually state the structure's "length" as the first word (usually a 64bit uint), then proceed to encode each entity's value, without separators (given the serialized subject data in each cell of the array is deterministic, so the binary parser doesn't need any lookahead or backtracking).
Would this behavior be the same, traditionally, for utf-8 strings? I can't see any other way for implementing a binary serialization for "unbounded" utf-8 strings, such that the parser wouldn't need backtracking (which can be really inefficient) or lookahead (which would also need excessive testing against various possibilities, also inefficient). My guess is that the "length" value would denote the number of characters, not the number of bytes, as the utf-8 encoding ranges from 1 to 4 bytes for each character, although the encoding itself denotes how many bytes exist in the character based on the first byte (eliminating backtracking and lookahead, per-character).
As an example, the octet stream for the string abc
would be
[0,0,0,0,0,0,0,3,97,98,99]
where 0,0,0,0,0,0,0,3
denotes the uint64 length of the input string, abc
.
Is my intuition correct, or is there something I'm missing?