3

I am trying to understand if it's more performant for a database to manage id/sequence generation for use as a PK or to have an application simply generate it's own ID as, say a random 64-bit number.

My suspicion is that application + random generation will be more performant because there is less concurrency to manage, but then again, if the DB generates a sequence then it always knows that what comes next is ordered and can optimize index maintenance.

Benjamin Hodgson
  • 42,952
  • 15
  • 108
  • 157
Christian Bongiorno
  • 5,150
  • 3
  • 38
  • 76
  • This is a great question. I was always taught that fragmented clustered indexes, as would happen with randomised IDs, are bad for performance. I'm keen to see what the experts have to say! – Benjamin Hodgson Apr 01 '16 at 23:47
  • Why would that happen in which database engine please? The index - according to its name - is an index, so a dictionary, a map, or name as you wish: it maps the PK values into row ids. It can not have any semantic assumption on what values are you using as PK. You can have e.g. strings or dates as PK. – Gee Bee Apr 02 '16 at 00:36
  • You should use guids for your primary keys, not numbers. – Software Engineer Apr 02 '16 at 00:40
  • Guids are strings which are inherently slower to index than a number that fits into a single word on a machine architecture. A 36 character uuid will either have to be hashed or sorted for indexing. Both are more computationally heavy than operating on a native word – Christian Bongiorno Apr 02 '16 at 03:31
  • There are 2 different types of indexes: comparative, like btree, and hashing. Each method has tradeoffs that can be effected by the underlying data type. – Christian Bongiorno Apr 02 '16 at 03:39
  • @ChristianBongiorno GUIDs are not strings. They are GUIDs. Just because it looks like a string when it's displayed doesn't mean it's represented in the same way as a string. (The same goes for numbers, lists, objects, anything.) In practice the database will store them in [the standard "16-byte number" representation](https://tools.ietf.org/html/rfc4122#section-4.1). – Benjamin Hodgson Apr 02 '16 at 09:46
  • Fair enough - representation does not necessarily = display. Then a GUID is just a 16byte random number. – Christian Bongiorno Apr 02 '16 at 13:33

1 Answers1

0

The idea behind PK is that they are unique by design. Using a 64 bit random number is not unique by design - so you have to prepare your code for PK collision, then randomizing a new id, and submit the new row again to the database. It can be quite a hassle. Having writers inserting new data in concurrency will eventually make these problem even worse.

The database can generate unique ids nicely - after all, the PK problem is pretty standard stuff, and there are good standard solutions for that.

The ID generator itself is a bottleneck - since it has to ensure uniqueness even if it is called from multiple sessions in parallel. For example, Oracle has sequence generators, which pre-generate and "cache" upcoming numbers per session, this way addressing the concurrency issue.

One important thing we can learn from Mr.Tom Kyte: measure your use case using realistic data in the tables and realistic load. This is the only source of true wisdom :)

My two cents: database engines differ a lot! Sharing what database you're using helps getting better answers.

Gee Bee
  • 1,794
  • 15
  • 17
  • 64-bit random numbers are not *technically* unique, but the chance of collision is so low that it's *effectively* unique (locally within your domain). UUIDs are not technically unique either but nobody is concerned about collisions. – univerio Apr 02 '16 at 00:44
  • Univerio is right. This is exactly how all hashing works. Consider how a 2gb file can produce an md5 hash as an identity. Though in theory a collision is possible, it will realistically never occur – Christian Bongiorno Apr 02 '16 at 03:34
  • Also, this is a DB agnostic question – Christian Bongiorno Apr 02 '16 at 03:35
  • I guess that *hashing* and *identifying* are different concerns. Hashing works well in one direction. When hashing, a low possibility of collisions is good enough. On the other hand, identification must work in two directions. When identifying, a collision is not allowed. For example, no two humans shall have exactly the *same* SSN, just because they have the same name, and born on the same day, which would cause the same hash. – Gee Bee Apr 04 '16 at 12:29
  • Hashing as 1:1 relation between a key and a value can be a source of confusion. E.g. a Java hashmap works by hashing, and offers unique link between keys and values. However, under the hood the hash value only selects a bucket, within the bucket there is an additional search for the actual items, since hashing has non-zero probability of resulting the same key for different values. – Gee Bee Apr 04 '16 at 12:36
  • @GeeBee hasing and identifying are not incompatible. Hashing is used to get the right "bucket" of data and then one of several methods is used to handle the collision; Such has chaining or mod probing. https://en.wikipedia.org/wiki/Hash_table – Christian Bongiorno Apr 04 '16 at 17:38
  • @Christian, this is exactly what I meant. I am sorry if I was not able to express it right. So, hashing alone is not a way for PK, because it can not elimiate collisions. You have to handle the collisions yourselves as I wrote in the answer. The idea of "hoping for the best", and "collision is a low proability" is too optimistic. – Gee Bee Apr 04 '16 at 23:01