Goal
We're trying to produce fact and dimension tables that will be easy for anybody to use. Many modern BI systems promote exploration and experimentation and we want people of all skill levels to be successful.
Problem
Our data has tons of NULL
able Foreign Key IDs. This causes unexpected behavior in almost every DB. In Snowflake (our data warehouse), JOIN
ing on NULL
results in a CROSS JOIN
, which is very very bad.
Potential Approaches
Require extra care in every JOIN
This can be done two ways but they both are easy to mess up or miss...
- Use
COALESCE(nullable, '--SOMETHING_INVALID--')
inJOIN
s - Check for
NULL
in theON
clause ofJOIN
s
Replace NULL
ids with a default
If we default to something valid, this will skew reporting in unknown ways. If we default to something invalid, we'd be breaking referential integrity in a second and hard-to-trace way.
SELECT
customer_id,
-- Some default "id" that wont cause a collision
COALESCE(location_id, '9999999999') AS location_id,
...
FROM
crm.customers
Add a row of all NULL
values to every table
It's a bit heavy-handed but... the JOIN
s will always work.
The Question
All of these have drawbacks.
- Is there something better/cleaner?
- Does dbt have tools or utilities to help?