2

The "third normal form" of database design asks you to remove functional dependencies. It seeks to eliminate redundancy, removing from a table any attributes (fields) that can be calculated from the other fields. For instance, when you make a reference to another entity, you only store its key. You don't store replicas of attributes from these referred entities, because that would mean having to update them every time you change the referred entity.

Another situation is an attribute such as height. You want to know the height of a person, but in your application you may like to know it in different units: meters, feet, astronomical units. But you don't store all of these values, you should remove calculated fields, so you only leave one of them (meters, of course), and you calculate the transformed values "on the fly", when you need it.

You also don't store something like age, you calculate them from date of birth. In this case the fact that age changes with time also plays a role. If you don't do this your data will soon be incorrect, unless you keep updating it constantly.

Now suppose you have a social network that displays the astrological sign of each user. For each user we display his or her sign, which is calculated from the date and time of birth, and also from the astrological tradition of the user's choice (e.g. Western or Chinese). Calculating someone's sign is a pretty specialized and involved calculation, but is nevertheless something you can just call from an astrological library and calculate. How would you design this database, would you remove this functionally dependent attribute, or would you calculate the user's sign once, or every time she updates her date of birth, then store it in the database and forget about the possibility of calculating a sign in the rest of your system, or would you enforce the "3NF" rule with an iron fist? And how do you see the potential benefit of being able to tweak a user's sign if she insists that her sign should be different from the one that your algorithm is calculating?

Now suppose a new application. You are building a system for the Army, that chooses possible occupations for new conscripts. Part of your system is a huge bad-ass pattern recognition machine, developed at ORNL, that tells you what occupations will be allowed for the conscripts, based on a whole lot of data that comes from your database. This pattern recognition method looks at each person's height, date of birth, medical records, school transcript, and it also looks at a long list of answers the person gave to one of two questionnaires. It also takes in consideration evaluation questionnaires filled by higher-rank officers, and the analysis that is made looks at all of the conscripts from each year at the same time, to come up with a bunch of parameters for, say, a feed-forward neural network. This neural network is simpler than the whole of the pattern recognition system, but it is still a pretty involved calculation, not just a temperature unit conversion. Nevertheless, you can see it as a "black box", and it can tell you the fate from each of your conscripts when you pull his record form the database.

At a specific date in the Army's calendar, the analysis is made, the ANN parameters are found, and now you can run the black box and tell each conscript what he will be doing for the next one year or two. It's a pretty important thing. You will be deciding people's occupations, you will be sending some people to operate ovens at the kitchen, and other to drive tanks. Everyone will be logging in the Army's network to check out their occupations, and see the result in their personal webpages.

Now, the little black box is there, and it outputs this important value based on the attributes pulled from your database. The attributes and parameters will probably never change, and the calculation of the conscript's destiny will probably be always correct, always the same value form the first day. But would you really leave that outside the database, just to follow the "remove calculated attributes" rule?

This is not just some calculation of age, astrological sign or temperature conversion. The pattern recognition system is indeed a path to calculate an attribute. But would you really like to keep that inside your system forever, re-calculating this hugely important thing over and over again? Or would you consider it better to calculate everything once, and forget that this magical black box exists? I mean, just run all of this crazy pattern recognition code once some day, then pick the results, and record them in the database. When someone logs in to see the output from his analysis, make it so the query is just a boring data-fetching procedure, and leave all the pettern recognition "excitement" for another time.

Another more palpable situation: You run a web store like Amazon. You have a pattern recognition method to recommend books to people, that is fed from a user's record in your DB. Would you keep running the thing all the time you need the current recommendations, or would you treat the pattern recognition box as something separate from the rest of the system, just feeding its results to the database from where it can be read by other programs? Wouldn't it be nice to have a control, for example, to make sure that you are not switching recommendations at an awkward moment?... Don't worry too much about computational burden, assume excellent memory and computational resources.

TL;DR -- Do you think the rule of removing calculated attributes should never be broken, or would you say it is OK to store in the database the results from very important attributes, that are calculated by very complex and delicate pattern recognition methods? Aren't there situations where we should just pretend that you actually cannot perform the calculation "on demand", and just leave recorded there these results, that won't be updated anyway?

dividebyzero
  • 2,190
  • 1
  • 21
  • 33

1 Answers1

2

Deliberate, judicial denormalization is "allowed" for caching the results of an expensive calculation. Converting units or calculating age would be examples of inexpensive operations that should probably not be cached, but other examples you cited look appropriate.

Depending on the DBMS, you may be able leave the tables normalized, and implement caching "on top" of them:

  • Some DBMSes support materialized views - like a regular VIEW but persisted so it doesn't need to be recalculated every time it's queried.
  • Some DBMSes that support calculated fields (aka. computed columns), also support persisting them (one example is the PERSISTED keyword under MS SQL Server).
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167