7

So I've seen several mentions of a surrogate key lately, and I'm not really sure what it is and how it differs from a primary key.

I always assumed that ID was my primary key in a table like this:

Users
    ID, Guid
    FirstName, Text
    LastName, Text
    SSN, Int

however, wikipedia defines a surrogate key as "A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data."

According to Wikipedia, it looks like ID is my surrogate key, and my primary key might be SSN+ID? Is this right? Is that a bad table design?

Assuming that table design is sound, would something like this be bad, for a table where the data didn't have anything unique about it?

LogEntry
    ID, Guid
    LogEntryID, Int [sql identity field +1 every time]
    LogType, Int
    Message, Text
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Nate
  • 30,286
  • 23
  • 113
  • 184

7 Answers7

14

No, your ID can be both a surrogate key (which just means it's not "derived from application data", e.g. an artificial key), and it should be your primary key, too.

The primary key is used to uniquely and safely identify any row in your table. It has to be stable, unique, and NOT NULL - an "artificial" ID usually has those properties.

I would normally recommend against using "natural" or real data for primary keys - are not REALLY 150% sure it's NEVER going to change?? The Swiss equivalent of the SSN for instance changes each time a woman marries (or gets divorced) - hardly an ideal candidate. And it's not guaranteed to be unique, either......

To spare yourself all that grief, just use a surrogate (artificial) ID that is system-defined, unique, and never changes and never has any application meaning (other than being your unique ID).

Scott Ambler has a pretty good article here which has a "glossary" of all the various keys and what they mean - you'll find natural, surrogate, primary key and a few more.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • So long story short, surrogate key can be the same as a primary key? – Nate Dec 22 '09 at 19:42
  • Yes, it can. Subtle difference. – Oded Dec 22 '09 at 19:43
  • 4
    SURE ! Surrogate is really just whether you use natural data (SSN) or a surrogate (artificial) data for your key. Surrogate can be primary, sure, no problem there. Primary doesn't have to be natural data (I usually argue against natural primary keys, except in very few cases) – marc_s Dec 22 '09 at 19:44
  • OK, thats what I thought, but I'd never heard it called a surrogate key. Thanks! – Nate Dec 22 '09 at 19:45
  • @Nate: The difference is that any key--primary or not--can either be surrogate (there are also many other names for this) or natural. It's not that the "surrogate key can be the same as the primary key", it's more like "the primary key can be a surrogate key or a natural key". – Adam Robinson Dec 22 '09 at 19:46
  • Absolutely not... Without a natural Key you risk corrupting the consistency of your data... See my answer below... If you care about data consistency having only a surrrogate key is a bad bad idea... – Charles Bretana Dec 22 '09 at 20:04
  • 1
    And Primary Key is kinda misleading.... All Keys are just that - Keys. What you use them for is more impoortant than whether they are the primary key or an alternate key. Any key ensures the uniqueness of the rows by the attribytes in the key. And any key can be used as the target of a FK or in a join... – Charles Bretana Dec 22 '09 at 20:06
  • 3
    "The primary key... has to be stable... REALLY 150% sure it's NEVER going to change??" -- stable <> immutable. – onedaywhen Dec 23 '09 at 08:55
  • 1
    No key has to be 100% sure it's never going to change.. Sure, the less likely it is to change the better, but unless you are using it as a foreign key in dependant tables (cause you wisely use an alternate surrogate key for that) then the impact of a change is .... trivial. -- You just ... change it.. – Charles Bretana Dec 24 '09 at 04:43
7

First, a Surrogate key is a key that is artificially generated within the database, as a unique value for each row in a table, and which has no dependency whatsoever on any other attribute in the table.

Now, the phrase Primary Key is a red herring. Whether a key is primary or an alternate doesn't mean anything. What matters is what the key is used for. Keys can serve two functions which are fundementally inconsistent with one another.

  1. They are first and foremost there to ensure the integrity and consistency of your data! Each row in a table represents an instance of whatever entity that table is defined to hold data for. No Surrogate Key, by definition, can ever perform this function. Only a properly designed natural Key can do this. (If all you have is a surrogate key, you can always add another row with every other attributes exactly identical to an existing row, as long as you give it a different surrogate key value)
  2. Secondly they are there to serve as references (pointers) for the foreign Keys in other tables which are children entities of an entity in the table with the Primary Key. A Natural Key, (especially if it is a composite of multiple attributes) is not a good choice for this function because it would mean tha that A) the foreign keys in all the child tables would also have to be composite keys, making them very wide, and thereby decreasing performance of all constraint operations and of SQL Joins. and B) If the value of the key changed in the main table, you would be required to do cascading updates on every table where the value was represented as a FK.

So the answer is simple... Always (wherever you care about data integrity/consistency) use a natural key and, where necessary, use both! When the natural key is a composite, or long, or not stable enough, add an alternate Surrogate key (as auto-incrementing integer for example) for use as targets of FKs in child tables. But at the risk of losing data consistency of your table, DO NOT remove the natural key from the main table.

To make this crystal clear let's make an example. Say you have a table with Bank accounts in it... A natural Key might be the Bank Routing Number and the Account Number at the bank. To avoid using this twin composite key in every transaction record in the transactions table you might decide to put an artificially generated surrogate key on the BankAccount table which is just an integer. But you better keep the natural Key! If you didn't, if you did not also have the composite natural key, you could quite easily end up with two rows in the table as follows

id  BankRoutingNumber BankAccountNumber   BankBalance
 1     12345678932154   9876543210123       $123.12
 2     12345678932154   9876543210123    ($3,291.62)

Now, which one is right?

To marc from comments below, What good does it do you to be able to "identify the row"?? No good at all, it seems to me, because what we need to be able to identify is which bank account the row represents! Identifying the row is only important for internal database technical functions, like joins in queries, or for FK constraint operations, which, if/when they are necessary, should be using a surrogate key anyway, not the natural key.

You are right in that a poor choice of a natural key, or sometimes even the best available choice of a natural key, may not be truly unique, or guaranteed to prevent duplicates. But any choice is better than no choice, as it will at least prevent duplicate rows for the same values in the attributes chosen as the natural key. These issues can be kept to a minimum by the appropriate choice of key attributes, but sometimees they are unavoidable and must be dealt with. But it is still better to do so than to allow incorrect inaccurate or redundant data into the database.

As to "ease of use" If all you are using the natural key for is to constrain the insertion of duplicate rows, and you are using another, surrogate, key as the target for FK constraints, I do not see any ease of use issues of concern.

Charles Bretana
  • 143,358
  • 22
  • 150
  • 216
  • 1
    I disagree - very strongly. A surrogate key e.g. an INT IDENTITY is **PERFECT** to uniquely and clearly identify a row. Any natural key more often than not might *look* like a good candidate, but in the end won't be neither truly always unique, nor easy to use. – marc_s Dec 22 '09 at 21:20
  • Also, yes - the primary key is just one of the possible candidate keys - a set of columns that will uniquely identify a row. Which one of all the possible candidate keys you end up picking for your primary key is somewhat arbitrary - but it makes good sense to pick **one** and stick with that one. – marc_s Dec 22 '09 at 21:21
  • 4
    @marc, keys are NOT there to "uniquely identify a row". They are there to guarantee data integrity and prevent the insertion of multiple rows that represent the SAME entity. What good does it do to be able to uniquely Identify a row if you have no idea what actual entity the row represents. or if the other data in the row is accurate or not ?? and if there are multiple rows for the same entity, with different data attributes, you have no idea which one has the right values in it. – Charles Bretana Dec 23 '09 at 00:48
  • 2
    @marc_s, No, a "primary" key is just another key.... what matters moer is whether it can accurately identify the real world entity the row represents (is it a surrogate or a natural?) and what are you using the key for... It matters nopt at all whether you label it as a "Primary" key, or just another unique Key... – Charles Bretana Dec 23 '09 at 00:50
  • The id column uniquely identifies the row and makes for fast joins. The first step to prevent an account from having two different balances is to make the an unique constraint on the RoutingNo and AccountNo fields, then an error is thrown that will prevent that probelm. For a normalized db, the routing numbers should be in their own table and this table would then have a FK to that table's id PK. – Bill Worthington Sep 30 '22 at 16:52
  • Of what possible value is it Tobe able to uniquely identify a *row* in the database if it doesn't represent a unique real-world entity *outside* the database? What If I have two (or more) *rows* in a database table, (with different ids), that all represent the exact same external entity, with each one having financial transactions attached to it? This is just an opportunity for confusion, prevents critical warnings of data inconsistency and provides nothing except a false sense of security. – Charles Bretana Sep 30 '22 at 20:35
3

Wow, you opened a can of worms with this question. Database purists will tell you never to use surrogate keys (like you have above). On the other hand, surrogate keys can have some tremendous benefits. I use them all the time.

In SQL Server, a surrogate key is typically an auto-increment Identity value that SQL Server generates for you. It has NO relationship to the actual data stored in the table. The opposite of this is a Natural key. An example might be Social Security number. This does have a relationship to the data stored in the table. There are benefits to natural keys, but, IMO, the benefits to using surrogate keys outweigh natural keys.

I noticed in your example, you have a GUID for a primary key. You generally want to stay away from GUIDS as primary keys. The are big, bulky and can often be inserted into your database in a random way, causing major fragmentation.

Randy

Randy Minder
  • 47,200
  • 49
  • 204
  • 358
  • 2
    GUIds as primary key are okay in SQL Server - as long as you don't use them as CLUSTERING KEY - those don't have to be the same (are by default, but you can split that up) – marc_s Dec 22 '09 at 19:48
  • The index fragmentation and all those negative properties come from using a GUID as CLUSTERING KEY - that's different from the PRIMARY KEY – marc_s Dec 22 '09 at 19:49
  • 4
    Database purists have apparently never used real world data. I have used literally hundreds of databases covering hundreds of different types of information and never yet seen a good natural key. Every database I've seen that tries to use one has trouble eventually with the data not being unique or changing and the change needing to filter through hundreds or thousands or even millions of related records. – HLGEM Dec 22 '09 at 20:03
  • "...never yet seen a good natural key" Are you kidding me? You've never seen a many-to-many table? Its key is ( table_1_id, table_2_id ). The whole point of natural keys is that surrogate keys often *allow* duplication of data -- case in point, the email address example I posted below! – user151841 Dec 22 '09 at 20:14
  • @user151841: A many-to-many table that uses two surrogate IDs as its primary key is not a "natural key" since it does not depend on a real-world model. – Petrus Theron Apr 20 '10 at 19:53
3

The reason that database purists get all up in arms about surrogate keys is because, if used improperly, they can allow data duplication, which is one of the evils that good database design is meant to banish.

For instance, suppose that I had a table of email addresses for a mailing list. I would want them to be unique, right? There's no point in having 2, 3, or n entries of the same email address. If I use email_address as my primary key ( which is a natural key -- it exists as data independently of the database structure you've created ), this will guarantee that I will never have a duplicate email address in my mailing list.

However, if I have a field called id as a surrogate key, then I can have any number of duplicate email addresses. This becomes bad if there are then 10 rows of the same email address, all with conflicting subscription information in other columns. Which one is correct, if any? There's no way to tell! After that point, your data integrity is borked. There's no way to fix the data but to go through the records one by one, asking people what subscription information is really correct, etc.

The reason why non-purists want it is because it makes it easy to use standardized code, because you can rely on refering to a single database row with an integer value. If you had a natural key of, say, the set ( client_id, email, category_id ), the programmer is going to hate coding around this instance! It kind of breaks the encapsulation of class-based coding, because it requires the programmer to have deep knowledge of table structure, and a delete method may have different code for each table. Yuck!

So obviously this example is over-simplified, but it illustrates the point.

user151841
  • 17,377
  • 29
  • 109
  • 171
  • 4
    yeah, but in this case, just put a unique constraint on the e-mail field - problem solved. PLUS: I **really** don't want a potentially 150-character field as my primary key!! Imagine having to reference that from a child table....... – marc_s Dec 22 '09 at 21:18
  • 2
    the second idea of having composite primary keys is even worse - if you need to reference that table from 3 or 4 child tables - imagine those queries and those messy statements! Avoid that like the plague - just use a single, surrogate ID and all your problems are solved. – marc_s Dec 22 '09 at 21:27
  • How do you do a primary key in a join table for a many-to-many relation? – user151841 Dec 22 '09 at 21:31
  • either no PK at all, or then an INT IDENTITY - plain and simple – marc_s Dec 22 '09 at 21:31
  • Also, what if your table uses e-mail as PK (so no duplicates there, granted), but also has another field which has to be unique? YOu can't have two PK, nor does picking the e-mail as a natural PK make it any easier to avoid duplicates. – marc_s Dec 22 '09 at 21:32
  • I was just trying to illustrate sides of the argument. Do *you* ever use composite keys? Where? (If not, why do they exist in RDBMSes?) Do you always have a surrogate key? FWIW, I'm not against UNIQUE keys, so I don't see any problem with another column being unique. – user151841 Dec 22 '09 at 21:43
  • 1
    "The reason that database purists get all up in arms about surrogate keys is..." -- that's not the whole story. You could use `UNIQUE (email_address)` and I'd still be "up in arms" if you used the surrogate for FKs because, now that the real key values no longer appear in the referencing tables, I have to use more joins... and wasn't using a surrogate meant to makes things more efficient? – onedaywhen Dec 23 '09 at 09:03
  • 1
    A well designed Delete method will take a dictionary of colName-colValue pairs as a primary key, and therefore even with natural keys, the code is the same for all tables. – ProfK Jan 04 '10 at 06:04
  • @marc_s, regards other unique fields: There is a reason one unique key is called primary, and others are just called unique. This isn't a symptom of a problem, but of the fact that it happens all the time, naturally, – ProfK Jan 04 '10 at 06:05
  • Right, the 'problem' with the "well-designed Delete method" is that the client programmer who calls the method has to construct the dictionary of the key. This required them to know the table structure, instead of just doing delete() or delete(id). It breaks encapsulation, which is why OO enthusiasts don't like it. – user151841 Jan 04 '10 at 15:25
1

Users Table

Using a Guid as a primary key for your Users table is perfect.

LogEntry table

Unless you plan to expose your LogEntry data to an external system or merge it with another database, I would simply use an incrementing int rather than a Guid as the primary key. It's easier to work with and will use slightly less space, which could be significant in a huge log stretching several years.

Petrus Theron
  • 27,855
  • 36
  • 153
  • 287
0
  • The primary key is whatever you make it. Whatever you define as the primary key is the primary key. Usually its an integer ID field.
  • The surrogate key is also this ID field. Its a surrogate for the natural key, which defines uniqueness in terms of your application data.

The idea behind having an integer ID as the primary key (even it doesnt really mean anything) is for indexing purposes. You would then probably define a natural key as a unique constraint on your table. This way you get the best of both worlds. Fast indexing with your ID field and each row still maintains its natural uniqueness.

That said, some people swear by just using a natural key.

Alex
  • 3,099
  • 6
  • 41
  • 56
0

There are actually three kinds of keys to talk about. The primary key is what is used to uniquely identify every row in a table. The surrogate key is an artificial key that is created with that property. A natural key is a primary key which is derived from the actual real life data.

In some cases the natural key may be unwieldy so a surrogate key may be created to be used as a foreign key, etc. For example, in a log or diary the PK might be the date, time, and the full text of the entry (if it is possible to add two entries at the exact same time). Obviously it would be a bad idea to use all of that every time that you wanted to identify a row, so you might make a "log id". It might be a sequential number (the most common) or it might be the date plus a sequential number (like 20091222001) or it might be something else. Some natural keys may work well as a primary key though, such as vehicle VIN numbers, student ID numbers (if they are not reused), or in the case of joining tables the PKs of the two tables being joined.

This is just an overview of table key selection. There's a lot to consider, although in most shops you'll find that they go with, "add an identity column to every table and that's our primary key". You then get all of the problems that go with that.

In your case I think that a LogEntryID for your log items seems reasonable. Is the ID an FK to the Users table? If not then I might question having both the ID and the LogEntryID in the same table as they are redundant. If it is, then I would change the name to user_id or something similar.

Tom H
  • 46,766
  • 14
  • 87
  • 128
  • There are not 3 kinds of Keys. Primary Key vs Alternate Key is just a label to attach to one of the keys. All Keys (Prim/Alternate) are the same. What matters is what it is used for. ANY key (Primary or ALternate), can be used as target of FK constraint or in a Join. Only a natural Key, (whether it is Primary or ALternate) can be used to help ensure data integrity/consistency. – Charles Bretana Dec 22 '09 at 20:36
  • Is a Primary Key a key? Is an Alternate Key a key? Is a Natural Key a key? Are they all exactly the same? It's not rocket science Charles... 1 + 1 + 1 = 3 – Tom H Dec 22 '09 at 23:04
  • @Tom H.: I think what @Charles Bretana is trying to say is that a key is just a key and that 'primary key' is a deprecated concept in relational theory. Here's some light reading: http://consultingblogs.emc.com/davidportas/archive/2006/09/14/Down-with-Primary-Keys_3F00_.aspx – onedaywhen Dec 23 '09 at 09:08
  • I understand, but when the OP specifically asked about primary keys I think that it's appropriate to talk about that term, whether or not Charles thinks it's a relevant term today. It's certainly not worth a down-vote. – Tom H Dec 23 '09 at 13:37