1

My particular problem that I'd like to solve is a table with very wide column (sometimes > 1000 columns) which requires me to split the table. In order to do so smartly, I'd like to split right before I get Row size too large (> 8126) or The maximum row size for the used table type, not counting BLOBs, is 65535..

For that purpose, I'm trying to find the actual size for each columns. Some are easy (like BOOLEAN, INT, DATE, TEXT, etc). I'm trying to figure out VARCHAR. Hence the experiment of creating tables with many VARCHAR columns.

However, I cannot really make sense of it. I tried experimenting with a table containing only VARCHAR(217), I was able to create 100 columns. The next column (101th) maximum size is VARCHAR(78).

CREATE TABLE IF NOT EXISTS test
(
col1    VARCHAR(14) NOT NULL,
col2    VARCHAR(14) NOT NULL,
...
col100  VARCHAR(217) NOT NULL,
col101  VARCHAR(78) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

I figured out that the reasoning is as below:

  1. 2 bytes for storage of length value and NOT NULL
  2. 217x3 = 651 bytes for off-page storage
  3. So 100 columns = 2x100 + 651x100 = 65300
  4. Max is 65535 - 65300 = 235 bytes
  5. Therefore, we can fit 1 more column of maximum (235 - 2) / 3 = 77.666666666666667 … or VARCHAR(78)

Applying the above for VARCHAR(128), I expect 65535 / (128x3+2) or 169 columns with the last column can fit VARCHAR(99). And it is indeed, true. This works:

CREATE TABLE IF NOT EXISTS test
(
col1    VARCHAR(128) NOT NULL,
col2    VARCHAR(128) NOT NULL,
...
col169  VARCHAR(128) NOT NULL,
col170  VARCHAR(99) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Applying the same logic for VARCHAR(64) doesn't work, however. Applying the same logic above, I would expect 337 columns. However, I can only get 197 columns. Anything more than that, I got Row size too large (> 8126) error.

CREATE TABLE IF NOT EXISTS test
(
col1    VARCHAR(64) NOT NULL,
col2    VARCHAR(64) NOT NULL,
...
col196  VARCHAR(64) NOT NULL,
col197  VARCHAR(64) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

The same thing is for VARCHAR(14). I can also create 197 columns max before hitting Row size too large (> 8126) error. I experienced further and discovered that up to VARCHAR(109), I would get Row size too large (> 8126) error. However for VARCHAR(110), I would get a different error The maximum row size for the used table type, not counting BLOBs, is 65535..

MySQL documentation says

When a table is created with ROW_FORMAT=DYNAMIC, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page.

which I suspect is what's going on here where InnoDB would store some values fully off page because there's no way 197 of 64-length utf8 strings can fit in 8126. If all of the 64-length utf8 strings are stored in-line, I would expect 41 columns maximum ( 8126 / (64x3+2) ).

Any help appreciated. I'm using MySQL 5.7.26-29, Barracuda file format and 16K page size.

Rick James
  • 135,179
  • 13
  • 127
  • 222
maresa
  • 571
  • 6
  • 15
  • 2
    If you have tables with 1000 columns, I would suggest you start with a Basic Relational Database Design Course – RiggsFolly Jul 07 '21 at 14:06
  • Unfortunately, it's not really feasible to normalize given this particular use-case. That's why it had to be done in flat manner which forces me to find this out. – maresa Jul 07 '21 at 18:06
  • The formulas are messy in a different way. – Rick James Jul 13 '21 at 00:17

2 Answers2

1

Even when InnoDB stores a varchar/blob/text column "fully" off-page, there is still a 20-byte pointer on the main page. So it's not like you can make a table with an infinite number of columns. Or even 1000 columns.

Also there's a limit to the row size imposed by the metadata file (.frm). It's pretty complicated. See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/

And this is completely changed in MySQL 8.0, because the .frm file has been replaced by a new data dictionary implementation. I haven't looked into this enough to know the limits.

All that said, you haven't described why you have a table with so many varchar columns, but I think any table that has hundreds of columns is a Code Smell. That is, it's not guaranteed to be a bad design, but it sure smells like one. It's like when your IDE complains at you when you write a single code method that is 1000+ lines long. Is it possible that a method might need to be that long? Sure. Is it usually a bad idea? Definitely.

Another software engineering metaphor you should be aware of is an XY Problem. Don't let yourself get so fixated on making one solution work that you overlook alternative solutions that might be much easier.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Thank you for your reply Bill. The use-case is rather unique that it's better to store things flat. Normalizing wouldn't make sense in this particular scenario; hence the issue. – maresa Jul 07 '21 at 20:30
  • Perhaps then a relational database is not what you need? Sounds like you need a _file_. – Bill Karwin Jul 07 '21 at 20:49
0

The max is just over 8KB on-record.

What ROW_FORMAT is being used?

Long strings may be split between the 8KB limit and a virtually unlimited "off-record" storage.

VARCHARs, TEXTs, VARBINARYs, and BLOBs are allocated in one of these ways:

  • Entirely on-record. This applies to strings < 40 bytes. (I think this applies to all row_formats.)
  • Up to 767 bytes in the record; the rest in an "overflow block". This applies to row_format=COMPACT.
  • Entirely in an overflow block. This applies for DYNAMIC and COMPRESSED. It leaves behind a 20-byte "pointer". So count it as 20 bytes.

I don't know:

  • Whether I am off by the "2" that you mentioned.
  • What happens in the cases not mentioned.

As for doing the 'vertical partitioning', I have these thoughts:

  • Do not have an "array" of things spread across columns. Instead spread them across rows in another table.
  • Split your wide table into more than just 2 tables.
  • Split it based on related information.
  • Split out columns that are mostly NULL. Then use LEFT JOIN when tying them back together since there could be missing rows in the new table.
  • Normalize repeated strings. Eg: 2-letter country_code for country. (But don't normalize short strings.)
  • Where practical, do use smaller column types of non-string columns (eg 1-byte TINYTINT versus 8-byte BIGINT)
  • Consider 'compressing' large text columns in the client and storing them in a BLOB. Typical text shrinks 3:1. This shrinkage helps with disk space, your row size limit, and perhaps performance.
  • ROW_FORMAT=COMPRESSED shrinks about 2:1, but has a lot of overhead and may not actually help with your question.
  • The CHARACTER SET probably has no impact (not even the 3x you mention) if all the text is using only English characters.
  • Would it work to have a few columns that are used for searching/sorting, plus one large JSON column? That JSON column could be gigabytes in size, landing mostly in the off-record storage. That would quickly solve the 8KB limit.
  • I have seen no one actually change the blocksize. But, in theory, a 32KB block (for all tables on your server) would give you a 16KB max per record. There is also a 64KB option, but it also limits the row size to 16KB.

More on sizes:

  • I think a NULL column takes 3 bytes.
  • A row has about 20-30 bytes of overhead.
  • A block is not filled more than 15/16th full (in hopes of allowing one extra row to be added). (I have quibbles with the logic.)
Rick James
  • 135,179
  • 13
  • 127
  • 222