0

Regardless of table importance, is it generally good practice to hash every single primary key, or should the type of data being stored be considered in that decision?

[Just to clarify the question] I was mainly wondering in the context of database security: is it a good practice to encrypt the primary key using md5 hashing on strings [as in using md5 before inputing the primary keys to the DB table] vs using an auto increment ID in the tables for a database that has the potential to be very large. Im currently working on an app where the database was designed in a way that the primary key of every table was encrypted using md5 before being saved, and was wondering if this was in good practice, or just unnecessary.

peyssoa
  • 40
  • 5

2 Answers2

4

It is a bad idea. In addition to what @blispr points out, scaling is a serious problem. This also shows up when using UUIDs and GUIDs.

When a key (PRIMARY KEY or otherwise) is 'random' (as UUIDs, hashes, etc, are), the 'next' lookup (for INSERT or SELECT) will be at some 'random' place in the Index (and/or table). When the table is huge, this means that the necessary block is unlikely to be in cache. In the extreme (index size much larger than cache), the cache will be thrashed and you will need about one disk hit per read or write operation. On conventional drives that is something like 100 hits per second. This won't be enough for huge tables.

So, you read somewhere to "keep the PRIMARY KEY small? Do it when convenient; don't go out of your way. Take, for example, "country codes". There are fewer than 256 countries, so you might be tempted to use TINYINT UNSIGNED, which takes 1 byte. I argue for using the standard 2-letter codes and CHAR(2) CHARACTER SET ascii, which takes 2 bytes. Simpler, more readable, and not enough bigger to matter.

Edit

An AUTO_INCREMENT is often (but not always) better because it is "chronological". That is, 'old' entries have small ids and are at one end of the table/index; 'new' entries are at the other end. In many applications, most of the activity is with the 'new' entries, therefore they tend to be cached while the 'old' entries stay on disk, unmolested.

Regardless of whether the PRIMARY KEY for my row is 'RickJames' or 12345 or '827ccb0eea8a706c4c34a16891f84e7b', I don't see much difference in "security". Don't confuse "obscurity" and "security".

On the other hand, if my id is 12345, a hacker can easily assume that 12346 and 12347 are probably valid ids, and could try to fetch their info. If that is your concern, then continue to use 12345, but also have some random value (not derivable from 12345) as a secondary value for validating that the id is not hacked. Store that value in the db for testing; you don't need to index it. (Hence, it won't hit my previous comments.) Furthermore, ids, security codes, etc, are probably best passed through cookies, not urls.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Many good things to note, I updated my question above, to be a bit more specific – peyssoa Dec 23 '15 at 02:43
  • I added a bunch more. – Rick James Dec 23 '15 at 05:58
  • A terminology note: "encryption" usually implies the reverse action ("decryption") is possible. MD5, on the other hand, is "one way" or "trap door". That is, no one (except presumably the NSA) can decrypt 827ccb0eea8a706c4c34a16891f84e7b back into 12345. – Rick James Dec 23 '15 at 06:01
  • @RickJames From your previous comments with user "Peyssa", i realize its a bad idea to encrypt the promary keys. BUT, Is it a good idea however, to ever encrypt the sensitive fields of a database? And if so, do you have any links that may help me? Will give you +1 if you have info. Thanks. – Just a coder Dec 23 '15 at 08:12
  • Yes, encrypt sensitive fields. And more. By "more", I mean things like putting the fields in a separate table and giving that table tighter `GRANTs`. Maybe even move them to a different server. You will be in deep do-do if your laptop full of Social Security Numbers is stolen, even if they are encrypted. Note that the key for decryption is probably sitting in the application code! – Rick James Dec 23 '15 at 16:32
  • Credit card numbers are _very_ sensitive. Even birth dates are somewhat sensitive, since they are a common key to other information. – Rick James Dec 23 '15 at 16:35
0

As @juergen replied, a primary key , when specified on a relational table, has an associated unique index - this is how the DB engine enforces the uniqueness.

Under the hood, for example if you have a string PK, then the Database may already be hashing it for storage in the internal data structure. Even if not hashed by the DB Engine, you may be creating an CPU overhead, at the time of creation of the row.

Also, PKs in one table will be an FK in another referring / child table. You will need to propagate these hash-keys elsewhere. This also makes your data 'unreadable' in a way.

blispr
  • 883
  • 5
  • 10
  • Updated my question above to be a bit more specific... but you are essentially saying that it would be much more efficient to use a more simple primary key [like auto incrementing an integer] which will lead to better scalability? – peyssoa Dec 23 '15 at 03:01
  • With encryption, its a different angle to the problem. Security now would be your concern - why else would one encrypt. Encrypting the primary key also means that its a sensitive piece of data. (and the rest of the columns are not as sensitive, if they're not being encrypted). You should look at the business or functional justification for using encryption and hashing. For performance goals, having an integer PK that is naturally indexed is a standard setup. Too many special tweaks in your table design (such as hashing without a stated business need) are not advisable. – blispr Dec 24 '15 at 18:11