Should I use a code in my lookup table

Question

I am woking on an Orable database and I am adding a couple of lookup tables.

The general question is should the lookup table contain a code and a description and the code be the FK back to the main table, or should the lookup table contain only a description and that be the FK back to the main table.

I am arguing for the code/description pair. I feel that if I have type = Contractor and code = CN a stored proc should say where type='CN' rather than only having the type=Contractor and no code and saying this in the stored proc: where type='Contractor' Because what if I want to display: General Contractor to the user rather than Contractor. I would then have to change the stored proc. I feel that I should not have to do this. (changing a stored proc requires recompilation in dev, migration to test, retesting by the clients, and migrating a prod which requires going through a change control proccess which involves a two week waiting period; whereas modifying a record in a table does not require any of this)

My data modeler wants to use only the description. His main argument is that it would require an unnessesary join.

Which way should it be done? And if if should be done the code/description way how do i convince the data modeler?

Thanks!

type_cd    type_dsc
CN         Contractor
IN         Inspector

Thanks everyone for the really good answers, there were so many godo ones. I tried to pick the answer that I thought would be the most helpfull to other people reading this; that is the point, right? Thanks everyone! Oh, and the final solution for me ended up being a Name/Description Pair. Similar to the Code/Description option except the Code is not 2 chars but 30. This was not my idea, but our new "company standard". Thanks! — kralco626, Feb 11 '11 at 11:26

score 13 · Accepted Answer · answered Feb 10 '11 at 13:25

Summarizing all of the answers, I think there are four alternatives for the lookup table:

Alternative 1:
• Description (primary key, a longer varchar2 column)

Alternative 2:
• Code (primary key, a short varchar2 column)
• Description (not null, a longer varchar2 column)

Alternative 3:
• Id (a meaningless primary key, an integer value derived from a sequence)
• Description (not null, a longer varchar2 column)

Alternative 4:
• Id (a meaningless primary key, an integer value derived from a sequence)
• Code (unique key, a short varchar2 column)
• Description (not null, a longer varchar2 column)

The primary key column will be in the main table with a foreign key constraint on top.

Some characteristics per alternative:

Alternative 1:
• No join is required when querying the main table
• Clear meaning when doing ad-hoc queries on main table
• Requires more storage for main table
• Index on main table will be much bigger than in other alternatives
• Updating a Description value means maintenance trouble and possibly application downtime.

Alternative 2:
• Join is required when you want to retrieve the description value
• Join in not required if you want to filter on certain lookup values: you can use the code value for that.
• Pretty clear meaning when doing ad-hoc queries on main table
• Minimal additional storage requirements for main table
• Index on main table will be small.
• Updating a Description value is easy, however code is usually an abbreviation from the description. When updating a Description value, the code can become confusing.

Alternative 3:
• Join is required when you want to retrieve the description value
• When filtering on certain lookup values, you'd have to use the Description values in your queries as the Id's are meaningless.
• Meaning is not clear when doing ad-hoc queries on main table
• Minimal additional storage requirements for main table
• Index on main table will be small.
• Updating a Description value is easy and doesn't cause confusion as with Code values

Alternative 4:
• Join is required when you want to retrieve the description value
• Join is required when filtering on certain lookup values, you would use the Code value in the lookup table.
• Meaning is not clear when doing ad-hoc queries on main table
• Minimal additional storage requirements for main table
• Index on main table will be small
• Updating a Description value is easy and you can also very easily update the Code value as well to make it resemble the Description value. You may have to revisit some of your code when doing this, though.

Personal opinion:

I would look at how I plan to use the main table and the lookup table. Which queries will be important and have to run efficiently? Will the values ever change?

My personal choice would be alternative 2 or 4. I'd use alternative 2 if I was absolutely sure that the code value can never change. And this is rare. Country codes change, social security numbers change. Currency codes change, etcetera. So, most of the time, I'd choose alternative 4. I'd not be so concerned for an extra join, especially because the lookup table is a small table.

But: choose an alternative that suits your requirements.

Please feel free to edit the text when you know some more characteristics of an alternative.

Regards,
Rob.

+1 for a really great write-up. Not only did you give alternative options, you also gave the most important characteristics/consequences of each one. I wish this answer gets accepted. — Ronnis, Feb 10 '11 at 22:08
Maybe I got this wrong, but on Alternative 3, updating Descriptions might be easy and not break referential integrity, but it would break/invalidate **all** your filtering queries, as they rely on the Description value and not on the ID (similar to updating Code on Alternative 4) — MestreLion, Oct 23 '18 at 11:54

Joe Stefanelli · Answer 2 · 2011-02-09T14:47:37.780

5

Code/Description. That code value is (I assume) going to be a smaller, more efficient integer. Furthermore, you don't want to be in a situation where you need to update all your foreign keys just because a text description changes at some time in the future.

EDIT: Based on the sample code you just added, I would encourage you to make your code value an integer value rather than a string like 'CN', 'IN'. You want your key value to be agnostic to any "meaning" associated to the description. 'CN' still implies 'Contractor' and if/when that description changes to be 'External Resource' then 'CN' is going to be misleading.

edited Feb 09 '11 at 14:47

answered Feb 09 '11 at 14:38

Joe Stefanelli

132,803
19
237
235

You make a good point about the integer, it is just less meaningfull in the sql stored proc. What do you think of his argument? I edited my question. – kralco626 Feb 09 '11 at 14:47
3

@kralco626: I think I was editing my answer at the same time you were posting your comment. I think the "extra join" argument holds little, if any, water. With proper indexing on the "code" value, the join should not be a problem at all. The benefits far outweigh the costs. – Joe Stefanelli Feb 09 '11 at 14:49
@Joe - I find your new argument about the integer intreging. As long as the integer is not autogen that should work. However the there is still no meaning in the code. Also the long description is much more likley to change, say from contractor to general contractor, than the logical idea, contractor to external resource. But you make a very good point. Although all the rest of our lookup tables have char(2) keys. – kralco626 Feb 09 '11 at 14:50
@Joe - That's my opinion also on the joins. Although I think is goign to be hard for "little old developer me" to win the efficiency argument against "big powerful data modeler guy" even though I do have a significant amount of experience with database design. Not sure how i am supposed to convince him that it's just as efficent. I also don't want to start any political issues. All my database work does have to go through this guy, so i don't wanna piss him off too much. But I deff don't want a desc as a key. – kralco626 Feb 09 '11 at 14:53
1

@kralco626: If the data modeler won't budge and assuming the data modeler is not your boss, I would convince your boss that your approach is correct and let him "encourage" the data modeler to see things your way. After all, that's why your boss gets paid the big bucks! :-) Of course, you'll have to feel your own way through your particular corporate culture. – Joe Stefanelli Feb 09 '11 at 14:58
@Joe - haha, ya I was hoping not to involve my boss on this one. But I am NOT going to accept a desc as a key without a fight. What do you think about the fact that a integer is meaningless when used in the stored proc? `type_cd = 1` dosn't mean much. Do you see this to be an issue? With proper indexing I woudl think efficency would not be an issue; although your argument of changing Contractor to External Party still stands. – kralco626 Feb 09 '11 at 15:01
@Joe - also, if I may ask another question, there is a second lookup table for status. I feel as though I have less of an argument for this because status is less likley to change. Should I also argue to have a integer key on that table as well? – kralco626 Feb 09 '11 at 15:03
@kralco626: For question 1, I you could comment your code `type_cd =1 /* Contractor */`. Yes, I realize that based on my earlier arguments the comment could become outdated, but that's relatively minor. For question 2, I would always use the code/description pair, no matter how "static" you think the descriptions are. An integer code is still going to yield smaller columns/indexes, etc. – Joe Stefanelli Feb 09 '11 at 15:09
@Joe - so I talked to the actual data modeler that is in charge and he was very understanding of the need to seperate the PK that your joining on and then actual description your showing the user. However, he said that their new standard is NOT to use a Code/Description but rather a Name/Description. He would want Name to be of size varchar(20) or 30 or something like that. He stipilates that it makes the joins less efficent, but he said it is easier to read the data... So what do you think of that??? – kralco626 Feb 09 '11 at 16:34
@kralco626: I'm not thrilled with it. Why take up 20 or 30 characters of space when an integer will suffice. It will make the indexes (which you *will* create for more efficient joins) unnecessarily bloated. I'd never have it in one of my designs but, if this is the established standard in your shop, then perhaps you need to go with the flow on this one. You've got to make the call on whether or not continuing the fight is worthwhile. – Joe Stefanelli Feb 09 '11 at 16:40
1

@Joe - Ya I'm going to mark this one down as a successfull compremise. It is a varchar so it should only take up the nessesary space(of course this is still more space than in integer or a char(2)). Also, doesn't a char(2)(2 bytes) take up half the amount of an integer (4 bytes)? So wouldn't a char(2) key actually be more efficent than a integer key? – kralco626 Feb 09 '11 at 17:01
@kralco626: that's why I usually choose a CHAR(4) code, as it takes the same _storage_ as INT. But their _efficiency_ is **not** the same! An auto-increment INT has contiguous values, while CHAR's values are "sparse", which can have an impact on index performance as they are usually made of hash trees optimized for contiguous values – MestreLion Oct 23 '18 at 12:16

score 4 · Answer 3 · answered Feb 09 '11 at 15:43

4

Well, this would depend on how "standard" those codes are.

Consider a lookup table like this one:

Code  Description
------------------
USD   United States Dollar
GBP   Pound Sterling
AUD   Australian Dollar
EUR   Euro

For this, I would use char(3) for the Code and make it the primary key. Your codes seem to be char(2) -- neat, small; smaller than an integer.

So my guess would be to use the Code as the PK in the lookup table and the "main table" would have Code as a FK to the lookup table.

And if your codes are not very standard and subject to change, integer is preferred.

answered Feb 09 '11 at 15:43

Damir Sudarevic

21,891
3
47
71

interesting. what do you say about the arguments that a integer key is more efficent than a CHAR(2) – kralco626 Feb 09 '11 at 15:46
@kralco626, `char(2)` is two bytes. Lookup tables do not have many rows. Do not see how would an integer (4 byte) be more efficient in this particular case. – Damir Sudarevic Feb 09 '11 at 15:51
@Damir - good point. I can't image that comparing the integers is that much better, if at all, than conparing char(2). Plus with an index... likley it does not matter. – kralco626 Feb 09 '11 at 15:58
i still would not use the 'code'. instead include that abbreviation as another field on the lookup table. then your UI can use that for query and display, but you do not confuse your data storage. – Randy Feb 09 '11 at 17:03
@Randy, and than you would need PK on the ID (integer) and an unique constraint on the `Code`. So, one extra constraint -- usually with a matching index. – Damir Sudarevic Feb 09 '11 at 17:09
I don't think one byte per row will make a measurable difference. Anyway Oracle stores numbers datatype as strings of numbers (each byte stores two numbers). Numbers up to 100 take at most 2 bytes (one byte for the figures, one byte for the exponent). You can use the DUMP function to see the byte representation of the different datatypes. – Vincent Malgrat Feb 09 '11 at 17:17
1

+1 Avoid the mania for using integer codes for everything, especially on these small lookup tables. In many cases the code itself will be self-explanatory and you won't need to join. Remember, the fastest JOIN is the one you don't even do. – Ken Downs Feb 09 '11 at 20:11
But what about the index on the foreign key of tables referencing this lookup table? Will it affect performance when we use string as primary key? I'm asking about the general case, not just this specific example. – DDMC Jun 14 '23 at 14:36

score 4 · Answer 4 · answered Feb 09 '11 at 22:48

There are existing standards and encoding schemes for a lot of things, that were thought out be people who are smarter than me and had a lot more time to think about it. For example, the ISO standard covers Sex Codes (iso 5218), country codes (iso 3166), language codes (iso 639), currency codes (iso 4217) etcetera. I bought Joe Celko's Data, Measurements and Standards in SQL last year, and it really surprised me just how many officially maintained ready-made standards and encoding schemes there are out there.

Ok fine, so once in a while some country abandon their bananas in favour of EUR/USD, and now you have to rewrite your entire application? No, you'd have to spend a few hours writing a script to merge/split whatever codes was changed. Big deal. Why don't you fix a few bugs during that same release while your're at it?

Personally, I use short character codes for almost everything I have to write code against, or when I need to assign behaviour depending to some "type code". The code is so intimately tied to the typecode, so why make it harder than it has to? The resulting code is easier to read, and it executes faster because I need less joins. For everything else (basically all user-generated), I use integer surrogates.

I have "only" worked with databases for 11 years, but I haven't seen many cases when the "name changed" so significantly that the code became misleading. A type code of "contractor", can't change into "Human resource manager" or "Vice president". That is a new code. It might split into "internal/external resource", but that would require code changes as well, in which case I don't see a problem with adding a few hours of data conversion to the project budget?

Finally, at some point you have to commit to a value that you put in your code. You can use whatever value you want, but it still means the same thing.

I've seen all of the following:

where type = 1 /* contractor */ 
  vs

int type_code = configfile.lookup("sqlcodes.contractor");
...
where type = :type_code
  vs

 from sometable
 join contract_types using(type_id)
where contract_types.type_name = 'Contractor';

...but I still don't see the benefit over just:

where type = 'CN'

The point I'm trying to make is: When we spend 80 hours on development, how the hell can 4 hours of database activities NOT fit into the project budget?

I do see a problem with "adding a few hours of data conversion to the project budget", because it's not just about those few hours of programming. It might also mean updating a billion row table, which can effectively take the application offline. And it's easy to avoid, so I wouldn't take this route. — Rob van Wijk, Feb 10 '11 at 09:08
100% uptime comes at a price. Many of the typical "quality attributes" like high-availability, security, performance, flexibility, scalability etc, are often are mutually exclusive. And just how many tables in a typical database have billions of rows? I'd argue that would be something of a special case. I'd even consider it a failure not to recognize it as such. And why should a special case dictate the design of the remaining 99% of the database? — Ronnis, Feb 10 '11 at 12:19

score 2 · Answer 5 · answered Feb 09 '11 at 14:38

2

use a numeric id value and a description. Store the id in the main table as the FK.

strings are lousy FK values, and basic normalization will tell your data modeler that you want the flexibility of changing the string once in the lookup table and not have to change it wherever it is referenced.

answered Feb 09 '11 at 14:38

Randy

16,480
1
37
55

You make a good point about the integer, it is just less meaningfull in the sql stored proc. What do you think of his argument? I edited my question. – kralco626 Feb 09 '11 at 14:45
You shoulod be able to find a ton of references here on SO about using integers as FK and the refuting of the invalid argument about the cost of a join. - mostly that boils down to 'thats what databases do - and they are good at it.' As far as being meaningful in the stored proc, you can always add comments to what you think the id references, at least those wont break when you do regular system maintenance on the data. – Randy Feb 09 '11 at 17:01

score 2 · Answer 6 · answered Feb 10 '11 at 05:04

Use a mnemonic (e.g. CN for Contractor, etc.) that is never changed. Let the UI show the description. A small code table will probably fit in one or two blocks and will therefore usually be found in the cache so lookups will be cheap.

Most importantly, future developers and people (like me) who come later and have to map this data to other systems will thank you, because it means that 80% of the time you can just query the table and understand it intuitively.

When I see a table like this, I just want to scream:

ADDRESS_ID
HOUSE_NUMBER
STREET_NAME
STREET_TYPE_ID
LOCALITY_ID

SELECT * FROM addresses WHERE street_type_id = 10053;

Codes never have to change (they're internal to the system, the end users shouldn't see them). Descriptions sometimes change, but usually not to a great degree. The occasions when a description changes so much that the mnemonic no longer makes sense, in my experience, are very rare.

score 1 · Answer 7 · answered Feb 09 '11 at 17:44

look, depending on key distribution - for most small lookup tables any queries joining to it are going to do a full table scan against the lookup and hash-join the key anyway - so numeric versus character is probably a complete non-issue for performance of the join.

The question really is - do you need the join?

I mean - if you are storing the lookup to be used as the display value by the UI, then yes use a lookup and join to it so you can update the display value easily.

If, on the other hand, you are using resource files for a multilingual app and you are using the returned code as the lookup, AND it is a small set of codes that is not expected to change (Gender_Code = 'M'ale or 'F'emale or 'U'nknown for example) - then make the codes meaningful, use a check constraint on the field to control the values, and don't even bother with a lookup table because you know them by the code and the UI will figure out how to display them.

score 0 · Answer 8 · answered Feb 09 '11 at 20:06

My suggestion is to use an int ID and a char/varchar description.

Use the ID in your queries and only link to the description when you need to display the description.

Do not worry that the ID will not look like the description. This is the way it is supposed to work. You want a non-significant ID so that no one will guess the meaning of 'CH' or 'EX', etc. Add comments to the code explaining what the ID is supposed to mean.

You want to be able to change the description at any time without breaking your code. You do NOT want to have to fix all the codes when they change.

Also, you can add one or more groups to the descriptions table. If you have several contractor types you can add a group column that indicates the type. Then you can link to the group description table and return all rows where the group is Contractor. Of course this group should have a lookup table with an ID and a description so you can change the displayed names of the lookup groups.

Tell the data modeler that you want to put the data in related tables. That is why it is called a relational database.

Should I use a code in my lookup table

8 Answers8