I am working in SQL Server 2008. I have been tasked with writing a stored procedure to do some data validations on external data before we move it into our star schema data warehouse environment. One type of test requested is domain integrity / reference lookup from our external fact data tables to our dimension tables. To do this, I use the following technique:
SELECT
some_column
FROM some_fact_table
LEFT JOIN some_dimension_table
ON
some_fact_table.some_column = some_dimension_table.lookup_column
WHERE
some_fact_table.some_column IS NOT NULL
AND
some_dimension_table.lookup_column IS NULL
The SELECT clause will match the column definition for an errors table that I will eventually move the output into via SSIS. So, the SELECT clause actually looks like:
SELECT
primary_key,
'some_column' AS Offending_Column,
'not found in lookup' AS Error_Message,
some_column AS Offending_Value
But, because the fact tables are very large, we want to minimize the number of times that we have to select from it. Hence, I have just 1 query for each fact table to check each column in question, which looks something like:
SELECT
primary_key,
'col1|col2|col3' AS Potentially_Offending_Columns,
'not found in lookup|not found in lookup|not found in lookup' AS Error_Messages,
col1 + '|' + col2 + '|' + col3 AS Potentially_Offending_Values
FROM fact_table
LEFT JOIN dim_table1
ON
fact_table.col1 = dim_table1.lookup_column
LEFT JOIN dim_table2
ON
fact_table.col2 = dim_table2.lookup_column
LEFT JOIN dim_table3
ON
fact_table.col2 = dim_table3.lookup_column
WHERE
dim_table1.lookup_column IS NULL
OR
dim_table2.lookup_column IS NULL
OR
dim_table3.lookup_column IS NULL
This has some problems with it. (1) If any of the source column rows is null, then the string concatenation in Offending_Values will result in NULL. If I wrap each column with ISNULL (and swap the nulls for something like an empty string), then I won't be able to tell if the test failed because of a true empty string in the source or if it was swapped for an empty string. (2) If just one of the columns fail in the lookup, then the error message will still read 'not found in lookup|not found in lookup|not found in lookup'
, i.e., I can't tell which of the columns actually failed. (3) The Potentially_offending_Columns column in the output will always be static, which means I can't tell if any of the columns failed just by looking at it.
So, in effect, I am having some design problems with my errors table. Is there a standard way of outputting to an errors table in this situation? Or, if not, what do I need to fix to make the output readable and useful?