1

I am struggling to understand what datatype to use for an MySQL database table.

Let's say we book publishing company and we need to create a database of all our books and authors in MySQL database. We have around 500000 books. A book has an unique ISBN (for example 978-3-16-148410-0).

So we have two options here to store our books:

  1. Create a id VARCHAR(24) NOT NULL natural primary key column and store our ISBNs there, or
  2. Create a surrogate id INT NOT NULL AUTO_INCREMENT but then also a isbn UNIQUE VARCHAR(24) column

What from I understand, the general consensus is not to use VARCHAR(n) as a primary keys as it takes more storage and performance doing lookups and joins, and generally that makes sense for me.

However, if all our operations will be towards ISBNs (SELECT * FROM books WHERE isbn = ?, UPDATE, DELETE, etc) - Why not use the VARCHAR(24) as the primary key?

I am struggling to understand that if you have an immutable natural key (like a book's ISBN) and 95% of all database operations require using that field regardless, shouldn't using a VARCHAR(24) always outperform a surrogate key design?

I feel like having a surrogate AUTO_INCREMENT INT key here, is just totally meaningless. It does not give any benefits.

Or am I missing something fundamental here when it comes to determining primary keys.

nvogel
  • 24,981
  • 1
  • 44
  • 82
corgrath
  • 11,673
  • 15
  • 68
  • 99
  • 1
    Avoid natural keys. They are never unique. – The Impaler Apr 23 '22 at 15:33
  • In this case, a book's ISBN is always unique. It will absolutely never change. – corgrath Apr 23 '22 at 15:33
  • Unfortunately, an ISBN it's not unique for your app. There are always issues with bar code readers, or typos when they are entered by hand, or printing mistakes, you name it. Real world apps need to account for all these cases, and there will be duplicates of ISBN, SSNs, VIN (cars), engine numbers, etc. If you are inserting a new row that is duplicate... which one is the correct one? Maybe the old one is bad. – The Impaler Apr 23 '22 at 15:35
  • I this specific question, ISBNs are immutable and absolute. You are focusing on the wrong thing in the question. I am not asking if ISBN is a reliable ID for books, but if an immutable and unique VARCHARs is worse than surrogate INT keys as primary keys. – corgrath Apr 23 '22 at 15:40
  • In terms of performance, yes a natural key is marginally slower than a synthetic key (such as `INT` or `BIGINT`). But again, it doesn't matter too much, since you should not use natural keys in the first place. They are only good for professors while teaching, not for the real world. – The Impaler Apr 23 '22 at 15:42

2 Answers2

5

I would use the ISBN as the primary key.

Primary key lookups in MySQL's default storage engine InnoDB are more efficient than lookups by secondary index.

It's true an integer takes less storage space than a 24-character varchar, but in your case, I assume you have to store the ISBN anyway. If you could use an integer instead of the ISBN, that would save storage.

The comment above that natural keys tend to violate uniqueness is a good warning in general. The violations usually come from the marketing department. ;-)

But for a given dataset, you can be sure that the natural key is free of duplicates. If you do experience an error reading the ISBN in your library collection, the librarian will have to resolve that manually. But I don't expect that to happen very often for 500,000 books.

Tip: Define the varchar with a binary collation, and it'll be a bit faster to do string comparisons. For example:

CREATE TABLE Books (
 isbn varchar(24) COLLATE utf8mb4_bin,
 -- ...other columns...
 PRIMARY KEY (isbn)
) DEFAULT CHARSET=utf8mb4;
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 4
    "...The violations usually come from the marketing department...." -- so true. – The Impaler Apr 23 '22 at 15:48
  • 1
    Thank you very much for this very good explanation! – Jonas Metzler Apr 23 '22 at 15:51
  • 1
    Saving storage, if using integers, is also not a valid argument everytime. Most databases would create a unique index to maintain the primary key, and that index would have been useful for lookups. I can imagine that I your case there would be a lot of ISBN lookups, which would then require creating an additional index; this takes not only extra space (additional ISBN plus the pointer to data row), but search results would need additional trip to disk for reading the data row. – tinazmu Apr 24 '22 at 02:07
0

Surrogate performance (General discussion)

  • "almost always"? No.

  • AUTO_INCREMENT that is not even used for Joining -- why bother having it if you have a "natural PK"; it takes space without providing any benefit.

  • UUID/GUID -- Frequently worse. This is due to lack of "locality of reference".

  • Many-to-many mapping table -- Always worse. Best: The natural PK is the pair of ids. A secondary key is the pair, but in opposite order.

  • Space: Since the PK is silently included in every secondary index, the bigger the PK, the bulkier the secondary index is. If you have exactly one secondary index, the size is a tossup. With more than two secondaries, space is being chewed up.

  • Range scan on the PK. If you need to use BETWEEN (etc) it is often beneficial to have that range key be the PK. If, instead, the range scan is going through a secondary index (very efficiently), but then having to reach into the data's BTree, this is a lot of extra work.

  • "Index merge". Some DB vendors perform AND or OR by gathering the "row identifiers", then and'ing or or'ing the lists. Then looking up the actual rows. The way InnoDB is structured, this is almost never worth doing.

  • Locality of reference. An auto-inc value lays out the data in roughly chronological order. This may be a benefit. Or it may not. Take sensor data or stock quotes. The main queries look up multiple rows based on sensor_id or ticker or author or user, not datetime, yet the data arrives in time-order. It is actually better (overall) to have PRIMARY KEY(ticker, datetime). (Or if there can be dups, then PRIMARY KEY(sensor_id, datetime, id), INDEX(id).) I saw one system double in throughput when they rearranged the PK of there biggest table in that way.

  • I have created hundreds (thousands?) of tables. Scanning through their PKs, I find that only 1/3 use an AUTO_INCREMENT surrogate PK.

Bottom line: Since I ponder your question every time I type CREATE TABLE, I would say that a "natural" key is better 2/3 of the time.

Looking at ISBN

Off to your specific, ISBN. Look at your tables. You have one that 'wants' to have ISBN as its PK, correct? How often do you JOIN to that table? How many secondary indexes are there on that table? You probably never do a range query like WHERE ISBN > "...", correct?

If you have id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, you will probably also have UNIQUE(isbn). But not vice versa.

ISBN should probably be VARCHAR(24) CHARACTER SET ascii COLLATE ascii_bin. This helps (a little) with both speed and space.

Rick James
  • 135,179
  • 13
  • 127
  • 222