3

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.

But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?

Thanks

BDuelz
  • 3,890
  • 7
  • 39
  • 62

3 Answers3

7

Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • This only answers half of the question. It doesn't speak to best practice; it just references normalization. Also, [others argue](http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html) that storing aggregates does violate third normal form. – Riley Major Apr 28 '14 at 17:38
  • Normal forms deal only with dependencies inside a single relation, i.e., [functional dependencies](https://en.wikipedia.org/wiki/Functional_dependency). They ignore inter-record dependencies, which are the ones described by the article linked by @RileyMajor. So although inter-record/inter-relation dependencies cause redundancy, like aggregated data calculated from other relation, traditional normal forms don't care about them. For more info, see this [simple guide](http://www.bkent.net/Doc/simple5.htm#label6) and [this paper](https://personal.comp.nus.edu.sg/~lingtw/papers/tods81.LTK.pdf). – fjsj May 19 '18 at 19:30
6

The phrase to remember is "Normalize till it hurts, Denormalize till it works"

It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.

So, Yes. Storing aggregated data 'goes against' normalisation.

There is no 'one best way' to denormalise; it depends what you are doing with the data.

Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
  • Are you saying that storing aggregated data violates 3NF? Why would it? – nvogel Jul 19 '11 at 09:20
  • 1
    @dportas: an aggregate is made up from other data: that means you are essentially holding data in 2 places. They may not contradict the definition of 3NF.... – Mitch Wheat Jul 19 '11 at 10:13
-1

Too much normalization will hurt performance so in the real world you have to find your balance.

I've handled a situation like this in two ways.

1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.

2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.

I used a wrapper class to do it so it's been while but I think in PHP was was as simple as: $_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);

Nonetheless, however you store that single value it saves you from running it with every request.

Alan B. Dee
  • 5,490
  • 4
  • 34
  • 29