Why does InnoDB require clustered index upon creating a table?

Question

Even if I don't have a primary key or unique key, InnoDB still creates a cluster index on a synthetic column as described below.

https://dev.mysql.com/doc/refman/5.5/en/innodb-index-types.html

So, why does InnoDB have to require clustered index? Is there a defenite reason clustered index must exist here?

In Oracle Database or MSSQL I don't see they require this. Also, I don't think cluster index have so tremendous advantage comparing to ordinary table either.

It is true that looking for data using clustering key does not need an additional disk read and faster than when I don't have one but without cluster index, secondary index can look up faster by using physical rowID. Therefore, I don't see any reason for insisting using it.

I'm not really sure what you expect us to say beyond that is described in the documentation you linked. Innodb creators made a design decision. If you do not like it, use a different table type or rdbms product. — Shadow, Feb 09 '18 at 06:25
@Shadow I just want to know the reason why they made that decision. My purpose is to understand more about clustered index, not to select a product to use. — Bonsaisteak, Feb 09 '18 at 08:17
There are probably plenty of optimisation decisions that can be made when there are fewer choices. Identifying which key points relate back to the first conversation that started "If we assume all tables have a clustered index we can do ..." is unlikely. — Damien_The_Unbeliever, Feb 09 '18 at 08:43
@PhanHoangMinh then you should ask the developers. We can only guess beyond what's written in the documentation. — Shadow, Feb 09 '18 at 10:15

score 1 · Accepted Answer · answered Feb 16 '18 at 22:29

Other vendors have a "ROWNUM" or something like that. InnoDB is much simpler. Instead of having that animal, it simply requires something that you will usually want anyway. In both cases, it is a value that uniquely identifies a row. This is needed for guts of transactions -- knowing which row(s) to lock, etc, to provide transactional integrity. (I won't go into the rationale here.)

In requiring (or providing) a PK, and in doing certain other simplifications, InnoDB sacrifices several little-used (or easily worked around) features: Multiple pks, multiple clustered indexes, no pk, etc.

Since the "synthetic column" takes 6 bytes, it is almost always better to simply provide id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, even if you don't use it. But if you don't use it, but do have a non-NULL UNIQUE key, then you may as well make it the PK. (As MySQL does by default.)

A lookup by a secondary key first gets the PK value from the secondary key's BTree. Then the main BTree (with the data ordered by the PK) is drilled down to find the row. Hence, secondary keys can be slower that use of the PK. (Usually this is not enough slower to matter.) So, this points out one design decision that required a PK.) (Other vendors use ROWNUM, or something, to locate the record, instead of the PK.)

Back to "Why?". There are many decisions in MySQL where the designers said "simplicity is better for this free product, let's not bother building some complex, but little-used feature. At first there were no subqueries (temp tables were a workaround). No Views (they are only syntactic sugar). No Materialized Views (OK, this may be a failing; but they can be simulated). No bit-mapped or hash or isam (etc) indexing (BTree is very good for "all-around" usage).

Also, by always "clustering" the PK with the data, lookups via the PK are inherently faster than the competition (no going through a ROWNUM). (Secondary key lookups may not be faster.)

Another difference -- MySQL was very late in implementing "index merge", wherein it uses two indexes, then ANDs or ORs the results. This can be efficient with ROWNUMs, but not with clustered PKs.

(I'm not a MySQL/MariaDB/Percona developer, but I have used them since 1999, and have been to virtually all major MySQL Conferences, where inside info is often divulged. So, I think I have enough insight into their thinking to present this answer.)

Thank you very much for your answer. So the reason is to always enable faster lookup using clustering primary key than competitors (assuming user usually want to use a PK). Thank you also for the meaning of "cluster", I wondered about it too. Also, I have found one more reason : because Clustered index can be stored into InnoDB memory, and the data can be retrieved logically, secondary index lookups does not require additional hard disk read. Considering cases when data fragmentation happen, secondary index lookup in InnoDB is not slower than competitors. — Bonsaisteak, Mar 07 '18 at 09:21
"Cluster" has many meanings. In my Answer, I focused on its meaning for the PK of InnoDB. (Another meaning involves multiple servers working together.) — Rick James, Mar 07 '18 at 15:13
"does not require additional hard disk read" -- Keep in mind that all disk reads are _potentially_ cached. That is there may _or may not_ be the actual delay involved of such. The number of disk blocks _touched_ (whether on disk or cached) is an important metric. With that, one should note which blocks are likely to be cached. (Etc -- this is a long discussion on its own.) — Rick James, Mar 07 '18 at 15:16

Why does InnoDB require clustered index upon creating a table?

1 Answers1