Primary Key Theory & I/O Efficiency

Question

According to RSBMS theory, when choosing a primary key, we are supposed to choose amongst minimal superkeys, effectively optimizing our key choice w.r.t # of columns.

Why are we interested in optimizing against # of columns instead of number of bytes? Wouldn't a smaller byte size PK result in smaller index tables and overall more read/write time efficient queries? For example, choosing a PK comprised of 2 varchar(16) rather than 1 varchar(64).

RDBMS theory does not say anything about PKs; traditional practice does. A PK is some CK that you decided to call the PK. RDBMS theory says nothing about implementation. CKs & superkeys are some helpful relational DB things relevant to normalization & other theory & practice. The fact that CKs are "minimal" superkeys has nothing to do with implementation. A certain useful thing just happens to be a certain special case of a certain other useful thing. PS Notice that a SQL PRIMARY KEY declaration declares (the SQL analogue of) a superkey--a smaller UNIQUE can be declared within it. — philipxy, Sep 13 '19 at 20:38
I think philipxy and nvogel are right. Superkeys are where omitting one column would still have uniqueness among the other fields, so your example of two 16 byte fields vs one 64 byte field, is not a superkey. — Marlin Pierce, Sep 14 '19 at 23:29

score 2 · Accepted Answer · edited Sep 13 '19 at 18:23

I think I agree with you.

I don't think theory accounts for physical storage.

Yes, if for instance, you created a column which was a SHA256 of two small columns, say VARCHAR(16), then yes the nodes of the B-tree in the index would take up more space, and the index would not be faster than indexing the two 16 byte columns.

There is some efficiency lost building an index which matches on the first column, and has to switch to comparisons on the second column. The b-nodes are more efficient if the whole b-node is comparing on the same column.

Honestly though, I don't think either amounts to much difference in efficiency. I think the statement is RDBMS theory not accounting for storage size.

score 2 · Answer 2 · answered Sep 14 '19 at 03:28

The identification of minimal rather than non-minimal superkeys is very important when defining keys in a database. If you choose to enforce uniqueness on three columns, A,B,C then that's very different from enforcing uniqueness on just two columns, A,B. A uniqueness constraint on A,B,C would not guarantee the uniqueness of A,B - so A,B would no longer be a superkey. On the other hand if the uniqueness constraint is on A,B then A,B,C is also a superkey. So it's essential from a data integrity point of view to know what the irreducible set of superkeys is.

This has nothing to do with primary keys as such because all keys must be minimal, not just the one you choose to call primary. Storage size and performance are something else. Internal storage is an important consideration in the design of indexes but size and performance are non-functional requirements whereas keys are all about logic and functionality.

Primary Key Theory & I/O Efficiency

2 Answers2