0

I've looked for a satisfying answer a tad more specific to my particular problem for a while now, but to avail. Whether I'm just not looking at the right places or not, I don't know, but here goes:

I'm pulling data from an application that afterwards is manipulated and sent to my own server. Amongst the data pulled is an, originally in the application's database, auto-incremented identifier. An example of this identifier I just now retrieved is 955534861. Isn't it better and more effective design to not auto-increment my primary key and just use the value I know is and will always stay unique, or should I look into concepts such as surrogate keys?

Thanks in advance.

Thevet
  • 154
  • 9
  • 3
    While there are passionate advocates of both camps, I think the short answer is "it's up to you". But, if you are going to retain the surrogate key, then it's good practice to formally declare the UNIQUE KEY on the natural index. – Strawberry Jul 03 '13 at 16:03

4 Answers4

2

The situation you describe resembles my primary job which is maintaining a data warehouse. We get data from other systems and store it.

Something that happens to us is that these "other systems" change. That leads to possibilities that the new version of the "other system" will duplicate the unique identifier from the previous system. We deal with this by adding something to that record in our data warehouse to guarantee it's uniqueness. It might be a field to identify the source system or it might be a date. It is never an autogenerated number.

If there is any chance of this this happening to you, you might want to expand your options.

Dan Bracuk
  • 20,699
  • 4
  • 26
  • 43
  • Dan is right of course. In the real world, source systems change somtimes, and with that, the possibility for key collisions arises. In OLAP particularly, best practice is to always use a surrogate key. – Curt Jul 03 '13 at 16:34
1

If there is a natural key in your model, you cannot replace it by creating a surrogate key.

You can only add a surrogate key and keep the existing natural key, which has its pros and cons, as described here.

Community
  • 1
  • 1
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
0

The primary key, typically an auto-incrementing ID, is what MySQL uses as a row identifier as well, so it should be left alone. If you need a secondary key that's generated by your application for some other purpose, you may want to add that as another column with a UNIQUE index on it.

In other databases where there's a proper row identifier mechanism, this is less of an issue.

tadman
  • 208,517
  • 23
  • 234
  • 262
0

This'll get a little nerdy, but bear with me:

As long as a key value is unique, it'll serve its function. But for performance, you ideally want that key value to be as short as possible.

GUIDs are commonly used, because they are statistically highly unlikely to ever be repeated. But that comes at the expense of size: they are 128 bits long, which makes them longer than a machine word. To compare two GUIDs (as must be repeatedly done when sorting, or migrating down a b-tree for indexes) will take multiple processor intructions to load and compare the values. And they will consume more memory when cached into memory.

The advantage of auto-incrementing key values is that

  • They are guaranteed to be unique. Proxy index values are only predicted to be unique.
  • Because they will have full value coverage over the range of their underlying datatype, the most compact possible type may be used. This makes for smaller indexes and more efficient compare operations
  • Because the smallest possible type can be used, more index values can be stored on a single database page, which means you're more likely to get a cache hit when searching or joining on that value. That means that peformance will be--all other things being equal--somewhat better.
  • On most databases, auto-incrementing keys are worked into the database engine, so there is very small overhead in generating them.
  • If you employ a clustered index on your key value, new record inserts are less likely to require a random disk seek, and more likely to be read during read-ahead, so if you do any kind of sequential processing or lookup based on that key, it'll probably run faster.
Curt
  • 5,518
  • 1
  • 21
  • 35