Redundancy vs. aggregated Data for performance

Question

I have several code/value/synonyms lists (i.a. ICD codes) that include several validity periods, aggregated into versions (one per year).

So now I may choose the fully normalized approach, with following structure:

VERSIONS(id INT PRIMARY KEY, name VARCHAR)
CODES(id INT PRIMARY KEY, code VARCHAR)
VALUES(id INT PRIMARY KEY, text VARCHAR)

CODEVALUES(code_id INT FOREIGN KEY CODES.id, 
    value_id INT FOREIGN KEY VALUES.id, 
    version_id INT FOREIGN KEY VERSIONS.id,
    synonym_nr INT) 
    with PK(code_id, value_id, version_id)

This way I may have up to 14 records for a codevalue, that didn't change in last 14 years. For >14000 codes with up to 20 synonyms I end up with >2,000,000 records in CODEVALUES.

An alternative could be, to use an aggregated table, like

CODES(code VARCHAR, value VARCHAR, synonym_nr INT, min_version INT, max_version INT)

No FKs. For every combination of code/value/synonym_nr there's only one record.

I know about normalization, but I'm trying to reduce development- and administration-complexity since I need one OR/M entity for every SQL table, including it's relations and because I have dozens of such code lists and a factor 4 for class numbers is significant,

I wonder if there are performance differences between those alternatives.

UPDATE:

The queries on these lists are of the kind, that I look up a certain code with a specific version and want the default value for that code (synonym_nr = 0). As these queries are often part of larger queries, there may be several 10k up to 100k of such code lookups per query-transaction. With approach #1 I have at least 2 joins and the Db has to hold a mapping record for every version (redundancy for code/value). While approach #2 defines a valid version range, which a have to query via

WHERE version >= min_version AND version <= max_version

So it's joins and more records(index efficiency?) versus range comparisons in query constraints. Would there be a significant performance difference?

Do you actually have a question? As for the situation you describe, 2 million records is not particularly large and most queries you want could probably be optimized using indexes. — Gordon Linoff, Mar 25 '15 at 13:36
I would advise you against making non-standard design decisions to make it easier to code. It will save very little time up front and cost LOTS more in the long run. Ease of development should never be a consideration in how you design your database. It should be designed with the data in mind. Oddly enough when the ddl is done correctly it makes the dml a lot easier as a bonus. — Sean Lange, Mar 25 '15 at 13:46
With your latest update it sounds as if you about to commit one of the most criminal offenses in database design. That is to make non-standard design decisions to deal with a performance issue that has not yet occurred. This is know as premature optimization and can cause all sorts of pain in the long run. Check out this article about the topic. http://www.sqlservercentral.com/articles/Performance+Tuning/115825/ — Sean Lange, Mar 25 '15 at 14:45

score 0 · Accepted Answer · answered Mar 25 '15 at 14:50

0

I'm completely with @SeanLange on this one;

It will save very little time up front and cost LOTS more in the long run.

Model properly now and you wont have to troubleshoot everyone else's queries later on.

Consider using smaller datatypes for your Version, Code and Value PKs, i.e. TINYINT or SMALLINT instead of INT if it's appropriate. Consider a view for your aggregated table and point your ORM at the view if you want to.

Alternatively, consider a different modelling approach. If the rate of change is low then using a 'from' and 'to' approach for the version numbers might be more compact.

Based on the way you've written your question I'm guessing you're at least reasonably competant with SQL Server. Try both approaches and look at the query plans for 'typical' queries to see how SQL Server handles the different approaches.

answered Mar 25 '15 at 14:50

Rhys Jones

5,348
1
23
44

Thanks. I tried your suggestion and aggregated at least the CODEVALUES table into an 'from'-'to' approach and built a view for querying. Unfortunately, measuring the total time needed for queries still resulted in the 1-table-approach being nearly twice as fast as the normalized-alternative. I attribute this to the fact, that the additional joins (normalized-approach) produce some extra sorts an merge joins in execution plan. So... finally I will go for a normalized table setup, that hold original data and automatically produce a denormalized aggregated (data mart like) table for querying. – Andreas H. Mar 27 '15 at 15:02
Anyway, since your answer gives real recommendations and poked me into new directions, I'll mark it as solution. – Andreas H. Mar 27 '15 at 15:09

Redundancy vs. aggregated Data for performance

1 Answers1