0

There is a debate between our ETL team and a Data Modeler on whether a table should be normalized or not, and I was hoping to get some perspective from the online community.

Currently the tables are set up as such

    MainTable               LookupTable
    PrimaryKey (PK)         Code (PK)
    Code (FK)               Name
    OtherColumns
  • Both tables are only being populated by a periodic file (from a 3rd party) through an ETL job
    • A single record in the file contains all attributes in both tables for a single row)
  • The file populating these tables is a delta (only rows with some change in them are in the file)
    • One change to one attribute for one record (again only by the 3rd party) will result in all the data for that record in the file
  • The Domain Values for Code and Name are not known.

Question:Should the LookupTable be denormalized into MainTable.

  • ETL team: Yes. With this setup, every row from the file will first have to check the 2nd table to see if their FK is in there (insert if it is not), then add the MainTable row. More Code, Worse Performance, and yes slightly more space. However ,regardless of a change to a LookupTable.Name from a 3rd party, the periodic file will reflect every row affected, and we will still have to parse through each row. If lumped into MainTable, all it is, is a simple update or insert.
  • Data Modeler: This is standard good database design.

Any thoughts?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user2210179
  • 73
  • 1
  • 7
  • Are these tables part of an OLTP or DW/BI system? Dimensional modeling is different from 3NF modeling. – Marek Grzenkowicz Jul 31 '13 at 06:31
  • These tables are a backup to a set of web services for a psuedo 24 by 7 availablilty. The main source of data is provided live by web services. If/When the web services go down (either expected maintenance or other reason), these tables will provide the last good known data. They won't be hit often, but when they are, they will be hit hard. – user2210179 Jul 31 '13 at 13:18
  • The third party webservice's are the ones going down for maintenance. Instead of getting a response from them, our calls will be mapped to this database. – user2210179 Aug 01 '13 at 22:30

1 Answers1

0

Build prototypes. Make measurements.

You started with this, which your data modeler says is a standard good database design.

    MainTable               LookupTable
    PrimaryKey (PK)         Code (PK)
    Code (FK)               Name
    OtherColumns

He's right. But this, too, is a good database design.

    MainTable
    PrimaryKey (PK)
    Name
    OtherColumns

If all updates to these tables come only from the ETL job, you don't need to be terribly concerned about enforcing data integrity through foreign keys. The ETL job would add new names to the lookup table anyway, regardless of what their values happen to be. Data integrity depends mainly on the system the data is extracted from. (And the quality of the ETL job.)

With this setup, every row from the file will first have to check the 2nd table to see if their FK is in there (insert if it is not), then add the MainTable row.

If they're doing row-by-row processing, hire new ETL guys. Seriously.

More Code, Worse Performance, and yes slightly more space.

They'll need a little more code to update two tables instead of one. How long does it take to write the SQL statements? How long to run them? (How long each way?)

Worse performance? Maybe. Maybe not. If you use a fixed-width code, like an integer or char(3), updates to the codes won't affect the width of the row. And since the codes are shorter than the names, more rows might fit in a page. (It doesn't make any sense to use a code that longer than the name.) More rows per page usually means less I/O.

Less space, surely. Because you're storing a short code instead of a long name in every row of "MainTable".

For example, the average length of a country name is about 11.4 characters. If you used 3-character ISO country codes, you'd save an average of 8.4 bytes per row in "MainTable". For 100 million rows, you save about 840 million bytes. The size of that lookup table is negligible, about 6k.

And you don't usually need a join to get the full name; country codes are intended to be human-readable without expansion.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • The data intregrity is mostly for alerting us if our file potentially contains bad or missing data. The code is not human human-readable by itself/ – user2210179 Aug 01 '13 at 22:57