-1

I am trying to figure out how to write a query to find the % of missingness (NULL values) in every column in a table in SQL Server, so the idea is to have a query return two columns:

  1. the first column is the names of every column in the table
  2. the second column should be the % missing in that column.

I keep trying to apply a case statement to every column but it isn't working, any help is greatly appreciated!

Thom A
  • 88,727
  • 11
  • 45
  • 75
Sam Cannon
  • 27
  • 1
  • 6
  • 3
    What is *missingness*? Do you mean nulls? Post sample data and expected results to clarify. – forpas Aug 06 '20 at 18:56
  • yes, I have edited the question to reflect that I mean NULL VALUES, simply records in the table that have NO DATA – Sam Cannon Aug 06 '20 at 18:57
  • What is a "cell"? Cells don't exist in SQL Server. Those are something you find in Excel Spreadsheets. – Thom A Aug 06 '20 at 18:58
  • When I say 'cell' i just mean a single record in a column – Sam Cannon Aug 06 '20 at 19:00
  • Take a look at my answer [here](https://stackoverflow.com/a/60654652/2029983). Changing the count to be `NULL` values would be quite easy if you take the time to understand the solution. – Thom A Aug 06 '20 at 19:01
  • Are you okay with writing out every column you want to check, or are you looking for a dynamic solution? – Andy Aug 06 '20 at 19:01
  • SQL Server doesn't have "records" either. Tables, for example, are made up of columns and rows. There are no "records" or "cells". – Thom A Aug 06 '20 at 19:02
  • Thanks Larnu, a row in a column – Sam Cannon Aug 06 '20 at 19:02
  • Actually, this [answer](https://stackoverflow.com/a/60187850/2029983) might a better candidate. Depends if you're after the entire database, or a single table. – Thom A Aug 06 '20 at 19:03
  • Oso, I have 400 columns so I would prefer not to have to write out every column name :) – Sam Cannon Aug 06 '20 at 19:03
  • Adapting @Larnu's solution is the way to go, then, but be aware that this is pretty advanced stuff and dynamic SQL is a minefield for beginners. – Andy Aug 06 '20 at 19:03
  • 1
    Side note: A table with over 400 columns sounds like you have some denormalisation issues you probably should be addressing. – Thom A Aug 06 '20 at 19:05
  • Sounds good, I'll jump into that solution then, thanks – Sam Cannon Aug 06 '20 at 19:05

1 Answers1

0

Ok, taking the first answer I linked you can make a query like this:

USE Sandbox;
GO

DECLARE @Schema sysname, @Table sysname, @IncDistinct bit = 0;
SET @Schema = 'dbo';
SET @Table = 'rCTEvsTally';

DECLARE @SQL nvarchar(MAX)

SET @SQL = N'WITH Counts AS (' + NCHAR(13) + NCHAR(10) + 
           N'    SELECT @Schema AS SchemaName,' + NCHAR(13) + NCHAR(10) +
           N'           @Table AS TableName,' +
           STUFF((SELECT N',' + NCHAR(13) + NCHAR(10) + 
                           N'           COUNT(' + CASE WHEN C.DATA_TYPE IN ('text','image') THEN 'NULL' ELSE N'CASE WHEN ' + QUOTENAME(C.COLUMN_NAME) + N' IS NULL THEN 1 END' END + N') AS ' + QUOTENAME(COLUMN_NAME + N'_NULLs') + N',' + NCHAR(13) + NCHAR(10) +
                           N'           COUNT(' + CASE WHEN C.DATA_TYPE IN ('text','image') THEN '1' ELSE QUOTENAME(C.COLUMN_NAME) END + N') AS ' + QUOTENAME(COLUMN_NAME) + N',' + NCHAR(13) + NCHAR(10) +
                           N'           COUNT(DISTINCT ' + CASE WHEN C.DATA_TYPE IN ('text','image') THEN '1' ELSE QUOTENAME(C.COLUMN_NAME) END + N') AS ' + QUOTENAME(COLUMN_NAME + N'_Distinct')
                   FROM INFORMATION_SCHEMA.COLUMNS C
                   WHERE C.TABLE_SCHEMA = @Schema
                       AND C.TABLE_NAME = @Table
                       --AND C.DATA_TYPE NOT IN ('text','image')
                   ORDER BY C.ORDINAL_POSITION
                   FOR XML PATH(N''),TYPE).value('.','nvarchar(MAX)'),1,14,N'') + NCHAR(13) + NCHAR(10) + 
           N'    FROM ' + QUOTENAME(@Schema) + N'.' + QUOTENAME(@Table) + N')' + NCHAR(13) + NCHAR(10) + 
           N'SELECT V.OrdinalPosition,' + NCHAR(13) + NCHAR(10) + 
           N'       V.ColumnName,' + NCHAR(13) + NCHAR(10) + 
           N'       V.NonNullCount,' + NCHAR(13) + NCHAR(10) + 
           CASE WHEN @IncDistinct = 1 THEN N'       V.DistinctCount,' + NCHAR(13) + NCHAR(10) ELSE N'' END +
           N'       V.NullCount,' + NCHAR(13) + NCHAR(10) + 
           N'       ISC.DATA_TYPE + ISNULL(NULLIF(DT.S,''(*)''),'''') AS Datatype,' + NCHAR(13) + NCHAR(10) +
           N'       K.KeyType' + NCHAR(13) + NCHAR(10) +
           N'FROM Counts C' + NCHAR(13) + NCHAR(10) + 
           N'     CROSS APPLY(VALUES' + STUFF((SELECT N',' + NCHAR(13) + NCHAR(10) + 
                                                       N'                       (' + CONVERT(varchar(4),C.ORDINAL_POSITION) +N',N' + QUOTENAME(C.COLUMN_NAME,'''') + N',C.' + QUOTENAME(C.COLUMN_NAME) + N',C.' + QUOTENAME(C.COLUMN_NAME + N'_Distinct') + N',C.' + QUOTENAME(C.COLUMN_NAME + N'_NULLs') + N')'
                                               FROM INFORMATION_SCHEMA.COLUMNS C
                                               WHERE C.TABLE_NAME = @Table
                                                 --AND C.DATA_TYPE NOT IN ('text','image')
                                               ORDER BY C.ORDINAL_POSITION
                                               FOR XML PATH(N''),TYPE).value('.','nvarchar(MAX)'),1,26,N'') + N')V(OrdinalPosition,ColumnName,NonNullCount,DistinctCount,NullCount)' + NCHAR(13) + NCHAR(10) +
           N'     JOIN INFORMATION_SCHEMA.COLUMNS ISC ON C.SchemaName = ISC.TABLE_SCHEMA' + NCHAR(13) + NCHAR(10) +
           N'                                        AND C.TableName = ISC.TABLE_NAME' + NCHAR(13) + NCHAR(10) +
           N'                                        AND V.ColumnName = ISC.COLUMN_NAME' + NCHAR(13) + NCHAR(10) + 
           N'     CROSS APPLY (VALUES(''('' + STUFF(CONCAT('','' + CASE ISC.CHARACTER_MAXIMUM_LENGTH WHEN -1 THEN ''MAX'' ELSE CONVERT(varchar(4),ISC.CHARACTER_MAXIMUM_LENGTH) END,' + NCHAR(13) + NCHAR(10)+
           N'                                            '','' + CASE WHEN ISC.DATA_TYPE NOT LIKE ''%int'' THEN CONVERT(varchar(4),ISC.NUMERIC_PRECISION) END,' + NCHAR(13) + NCHAR(10) +
           N'                                            '','' + CASE WHEN ISC.DATA_TYPE NOT LIKE ''%int'' THEN CONVERT(varchar(4),ISC.NUMERIC_SCALE) END,' + NCHAR(13) + NCHAR(10) +
           N'                                            '','' + CASE WHEN ISC.DATA_TYPE NOT IN (''datetime'',''smalldatetime'') THEN CONVERT(varchar(4),ISC.DATETIME_PRECISION) END),1,1,'''') + '')'')) DT(S)' + NCHAR(13) + NCHAR(10) +
           N'     OUTER APPLY(SELECT TC.CONSTRAINT_TYPE AS KeyType ' + NCHAR(13) + NCHAR(10) + 
           N'                 FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS TC' + NCHAR(13) + NCHAR(10) + 
           N'                      JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE KCU ON TC.TABLE_SCHEMA = KCU.TABLE_SCHEMA' + NCHAR(13) + NCHAR(10) + 
           N'                                                                  AND TC.TABLE_NAME = KCU.TABLE_NAME' + NCHAR(13) + NCHAR(10) + 
           N'                                                                  AND TC.CONSTRAINT_NAME = KCU.CONSTRAINT_NAME' + NCHAR(13) + NCHAR(10) + 
           N'                 WHERE KCU.COLUMN_NAME = V.ColumnName' + NCHAR(13) + NCHAR(10) + 
           N'                   AND TC.TABLE_SCHEMA = ISC.TABLE_SCHEMA' + NCHAR(13) + NCHAR(10) + 
           N'                   AND TC.TABLE_NAME = ISC.TABLE_NAME) K' + NCHAR(13) + NCHAR(10) +
           N'ORDER BY V.OrdinalPosition';

PRINT @SQL; --you will need to use the SELECT here if @SQL is over 4,000 characters
--SELECT @SQL;
EXEC sp_executesql @SQL, N'@Schema sysname,@Table sysname',@Schema = @Schema, @Table = @Table;

This exposes the various counts for the parametrised table.

Rather than butchering that solution, you can then INSERT that dataset into a (temporary) table, and then do the work.

--All prior code goes above apart from the call to sp_executesql
CREATE TABLE #ColumnCounts (OrdinalPosition int,
                            ColumnName sysname,
                            NonNullCount decimal(15,0), --Decimal as we want a percentage later
                            NullCount decimal(15,0), --Decimal as we want a percentage later
                            Datatype sysname,
                            KeyType nvarchar(50));
INSERT INTO #ColumnCounts
EXEC sys.sp_executesql @SQL, N'@Schema sysname,@Table sysname',@Schema = @Schema, @Table = @Table;

SELECT CC.ColumnName,
       NullCount / (NonNullCount + NullCount) AS ["Missingness"]
FROM #ColumnCounts CC;

DROP TABLE #ColumnCounts;

Like I state on many of my complex Dynamic SQL solutions, this is not entry level. If you do not understand it you should likely not be using it, or you should be taking the time to understand it. If you want to start understanding it, I suggest first looking at the generated SQL and then working backwards.

Thom A
  • 88,727
  • 11
  • 45
  • 75
  • You're welcome, @SamCannon , but like I said, you need to take the time to understand the solution here before you implement it. – Thom A Aug 06 '20 at 19:46