0

We are currently working on Data mart design. We are having many Foreign keys to dimension tables. We are thinking whether to allow NULL in Foreign key dimension fields or have -1 to represent NULL values.

Kimball suggests to keep default row for NULL values. http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-null/

My lead suggests to keep NULL as NULL.

Will there be any performance impact for keeping NULL in Foreign key fields ?

Venkataraman R
  • 12,181
  • 2
  • 31
  • 58

2 Answers2

2

Kimball is right (as he usually is). Use a default value where you would use NULL.

Why? It ensures that joins to the dimensions will not "accidentally" filter rows. Trying to reconcile results from different queries eats up a lot of time. Ensuring that joins succeed is one method of reducing such discrepancies.

If you are not going to follow his advice, then store using NULL. A value such as -1 is particularly bad -- because it prevents the database from enforcing foreign key constraints.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Another reason to avoid using NULL that Gordon hasn't covered: it's unclear what NULL means.

Sometimes you have a NULL in a data mart or data warehouse because something has gone wrong in the ETL or in a source system, leading to a NULL. Other times you have a NULL because that column doesn't apply to that particular row. Or in the case of something like an accumulating snapshot table, because that column has not been populated yet, as the process being reported on hasn't yet reached the point where that column will be populated.

Rather than a single default value I like to set up multiple; for instance, you can set up every dimension to have a row that indicates "Unknown" which you might use for missing values, and a row that indicates "N/A" for cases where the value does not apply. I tend to set these up with negative integers for keys (-1 is Unknown, -2 is N/A, etc.), as that allows me to use the same keys for these rows in every table. But as both Kimball and Gordon indicate, you should actually create those rows in your dimensions.

This makes it really easy to run data quality checks looking for cases where something has gone wrong. It means you can display some meaningful values in any reporting or analysis tools so people can filter out rows that haven't fully populated if they want to, or so your data stewards can look for problematic data via those tools. Or perhaps people might want to specifically look for those rows where one of the dimensions isn't applicable.

If you have a situation where data sometimes loads in the "wrong" order (i.e. a fact table gets populated, but relevant dimension members haven't been added a dimension yet), you can also use this to check for rows that need updating in your ETL and automate fixing the issue, without repeatedly trying to update those rows that do not need updating because they will always have a NULL.

And down the line when someone else takes over support of this data mart, they'll be really thankful when they don't have to spend a huge amount of time unpicking whether those NULLs or -1s indicate a problem.

Jo Douglass
  • 2,055
  • 1
  • 19
  • 30