1

Goal

We're trying to produce fact and dimension tables that will be easy for anybody to use. Many modern BI systems promote exploration and experimentation and we want people of all skill levels to be successful.

Problem

Our data has tons of NULLable Foreign Key IDs. This causes unexpected behavior in almost every DB. In Snowflake (our data warehouse), JOINing on NULL results in a CROSS JOIN, which is very very bad.

Potential Approaches

Require extra care in every JOIN
This can be done two ways but they both are easy to mess up or miss...

  1. Use COALESCE(nullable, '--SOMETHING_INVALID--') in JOINs
  2. Check for NULL in the ON clause of JOINs

Replace NULL ids with a default
If we default to something valid, this will skew reporting in unknown ways. If we default to something invalid, we'd be breaking referential integrity in a second and hard-to-trace way.

SELECT
  customer_id,
  -- Some default "id" that wont cause a collision
  COALESCE(location_id, '9999999999') AS location_id,
  ...
FROM
  crm.customers

Add a row of all NULL values to every table
It's a bit heavy-handed but... the JOINs will always work.

The Question

All of these have drawbacks.

  • Is there something better/cleaner?
  • Does dbt have tools or utilities to help?
Baer
  • 3,690
  • 25
  • 24
  • Do you have nulls ids in dimension tables as well? – Radagast Dec 15 '21 at 01:05
  • Yes, the dimension may have a half dozen IDs in it that are all nullable. For example, a customer may have a null location_id and a null customer_category_id – Baer Dec 15 '21 at 03:14
  • [Design Tip #43: Dealing With Nulls In The Dimensional Model - Nulls as Fact Table Foreign Keysl](https://www.kimballgroup.com/2003/02/design-tip-43-dealing-with-nulls-in-the-dimensional-model/) and [Design Tip #128 Selecting Default Values for Nulls](https://www.kimballgroup.com/2010/10/design-tip-128-selecting-default-values-for-nulls/) – Lukasz Szozda Dec 15 '21 at 14:14
  • If keys of interest are all integers, I would leave the nulls in the fact tables as is for traceability reasons, but would impute nulls in dimension tables with a negative integer(-999 or something). Or vice versa if that's easier to implement. You'd have to make sure the ids in the fact tables can only be positive integers through some logic that gets applied upstream in the process. – Radagast Dec 15 '21 at 20:58

1 Answers1

0

I think a good approach would be to create views on top of each of the table. In those views you can take care of the cleaning of the data such as dealing with NULL values, and then only use the views for JOINing and whatever other downstream reporting / processing you need to do.

Edit following response to my comment, You could add a record into each of the dimensions (such as Location) for "Unspecified" and/or "Unknown" with e.g. Location_ID being -1 or 999999999 Then you need to update the tables containing the NULL values (referring to Location in this example) to those values you want (-1 or 999999999). Then you can set these values to be the default to deal with any new insertion/update on the tables using those keys. And then you can enforce referential integrity.

  • The question is not really about where to do the cleaning, it's a softer best practice on how to clean in such a way that the data will be easy for people to use. – Baer Dec 14 '21 at 23:40
  • The edited comment in the answer above is correct. Your fact table should never have null values in its keys and you need to enforce this in the process that loads the fact table – NickW Dec 16 '21 at 08:40