1

Does InnoDB stores multibyte strings in expanded form, in indexes?

For example, does each utf8mb4 string take 4 bytes?

I've tried to test this empirically from information_schema.tables.index_length, however, the value is not deterministic, so it's not a reliable method. I also couldn't find this concept in the documentation.

Edit: to clarify, the question is, in a nutshell: how many bytes are required to store a 1-byte utf8mb4 character (say, U+0050) in an InnoDB index on a CHAR(1) NOT NULL column (not taking into account the index metadata)?

Marcus
  • 5,104
  • 2
  • 28
  • 24

2 Answers2

0

All characters in utf8 string are stored as variable-length encodings. Each character uses either 1, 2, 3, or 4 bytes depending on its code point. A string can have a mix of encodings, because each code point identifies its length in the initial bits of each byte.

enter image description here

The characters that are in the ASCII subset will only use 1 byte.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • This answer refers to UTF-8 in general (rather than InnoDB utf8mb4) so it's worth pointing out, @Marcus, that UTF-8 doesn't allow a codepoint to be encoded as anything that could be called "expanded form". It allows exactly one byte-sequence representation for a codepoint. – Tom Blodget May 13 '19 at 22:20
  • 1
    Yes, I agree. To be clear, what Tom is saying is that each codepoint has one representation, not that the representation uses one byte. :) – Bill Karwin May 13 '19 at 22:24
  • OK, so, to clarify the question - how many bytes are required to store a 1-byte utf8mb4 character (say, `U+0050`) in an InnoDB index on a `CHAR(1) NOT NULL` column (not taking into account the index metadata)? – Marcus May 15 '19 at 08:13
  • A one-byte character is stored in one byte. – Bill Karwin May 15 '19 at 15:37
0

For example, does each utf8mb4 string character take 4 bytes? -- Each character takes up to 4 bytes.

You can use LENGTH(col) to find the number of bytes in a column.
You can use CHAR_LENGTH(col) to find the number of characters in a column.

Those two numbers will be in ratio somewhere between 1:1 and 4:1. English will be 1:1; western Europe: between 1:1 and 2:1; Russian and Greek will be 2:1; most Asian languages will be 3:1; strings with Emoji may get close to 4:1.

You will not find the 6 characters "U+0050" in MySQL tables unless you went through some encoding. Don't do that.

information_schema.tables.index_length indicates how much space is used for all secondary indexes. There are many reasons why this is unrelated to your question -- BTree overhead, block overhead, row overhead, column overhead, pointers to sub-nodes, etc, etc.

Rick James
  • 135,179
  • 13
  • 127
  • 222