Does InnoDB stores multibyte strings in expanded form, in indexes?

Question

For example, does each utf8mb4 string take 4 bytes?

I've tried to test this empirically from information_schema.tables.index_length, however, the value is not deterministic, so it's not a reliable method. I also couldn't find this concept in the documentation.

Edit: to clarify, the question is, in a nutshell: how many bytes are required to store a 1-byte utf8mb4 character (say, U+0050) in an InnoDB index on a CHAR(1) NOT NULL column (not taking into account the index metadata)?

score 0 · Accepted Answer · answered May 13 '19 at 15:06

0

All characters in utf8 string are stored as variable-length encodings. Each character uses either 1, 2, 3, or 4 bytes depending on its code point. A string can have a mix of encodings, because each code point identifies its length in the initial bits of each byte.

The characters that are in the ASCII subset will only use 1 byte.

answered May 13 '19 at 15:06

Bill Karwin

538,548
86
673
828

This answer refers to UTF-8 in general (rather than InnoDB utf8mb4) so it's worth pointing out, @Marcus, that UTF-8 doesn't allow a codepoint to be encoded as anything that could be called "expanded form". It allows exactly one byte-sequence representation for a codepoint. – Tom Blodget May 13 '19 at 22:20
1

Yes, I agree. To be clear, what Tom is saying is that each codepoint has one representation, not that the representation uses one byte. :) – Bill Karwin May 13 '19 at 22:24
OK, so, to clarify the question - how many bytes are required to store a 1-byte utf8mb4 character (say, `U+0050`) in an InnoDB index on a `CHAR(1) NOT NULL` column (not taking into account the index metadata)? – Marcus May 15 '19 at 08:13
A one-byte character is stored in one byte. – Bill Karwin May 15 '19 at 15:37

score 0 · Answer 2 · answered May 24 '19 at 01:23

For example, does each utf8mb4 ~~string~~ character take 4 bytes? -- Each character takes up to 4 bytes.

You can use LENGTH(col) to find the number of bytes in a column.
You can use CHAR_LENGTH(col) to find the number of characters in a column.

Those two numbers will be in ratio somewhere between 1:1 and 4:1. English will be 1:1; western Europe: between 1:1 and 2:1; Russian and Greek will be 2:1; most Asian languages will be 3:1; strings with Emoji may get close to 4:1.

You will not find the 6 characters "U+0050" in MySQL tables unless you went through some encoding. Don't do that.

information_schema.tables.index_length indicates how much space is used for all secondary indexes. There are many reasons why this is unrelated to your question -- BTree overhead, block overhead, row overhead, column overhead, pointers to sub-nodes, etc, etc.

Does InnoDB stores multibyte strings in expanded form, in indexes?

2 Answers2