0

Looking for the most efficient way to check for nulls and have a desired output for a report. This is done in a Hadoop environment.

For example, Database contains:

FirstName LastName State
John {null} OH
Jane Smith {null}
Pikachu Pokemon HI

The output from the query would be:

RuleNum ColName DefectCount TableCount DateofData
Rule001 FirstName 0 50 2023-01-01
Rule002 LastName 1 50 2023-01-01
Rule003 State 1 50 2023-01-01

The embarassingly clunky and inefficient query that that I use, which works- but it takes a LONG time to get results is(and we check thousands of columns)- and i'm unsure as to how to get the table count included efficiently:

SELECT
CAST('Rule001' AS VARCHAR(255)) AS RuleNum,
CAST('FirstName' AS VARCHAR(255)) AS ColName,
A.DefectCount AS DefectCount,
CAST(('date_add(current_date(),-1)) AS VARCHAR(255) AS DateofData
FROM
(SELECT COUNT(1) AS DefectCount
FROM DbName.TableName
WHERE FirstName IS NULL

UNION ALL

SELECT
CAST('Rule002' AS VARCHAR(255)) AS RuleNum,
CAST('LastName' AS VARCHAR(255)) AS ColName,
A.DefectCount AS DefectCount,
CAST(('date_add(current_date(),-1)) AS VARCHAR(255) AS DateofData
FROM
(SELECT COUNT(1) AS DefectCount
FROM DbName.TableName
WHERE LastName IS NULL

UNION ALL

SELECT
CAST('Rule003' AS VARCHAR(255)) AS RuleNum,
CAST('State' AS VARCHAR(255)) AS ColName,
A.DefectCount AS DefectCount,
CAST(('date_add(current_date(),-1)) AS VARCHAR(255) AS DateofData
FROM
(SELECT COUNT(1) AS DefectCount
FROM DbName.TableName
WHERE State IS NULL
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Supernova
  • 29
  • 6
  • 2
    I've removed the conflictin (R)DBMS tags. You will need to [edit] your question to (re)tag the product you are *actually* using. – Thom A Jan 29 '23 at 16:22
  • for the output you wanr there isn't a better way, but why not have all columns defined as NIOT NULL, so that you don't need to count – nbk Jan 29 '23 at 16:40
  • Yeah, I assumed it was SQL Server until `CAST(('date_add(current_date(),-1)) AS VARCHAR(255) AS DateofData`. So not so sure anymore. – null_pointer Jan 29 '23 at 16:54
  • I would build a dynamic query to get a row of null counts for nullable columns of the table and unpivot it. – Serg Jan 29 '23 at 17:53
  • The date field isn't as important- the syntax can be adjusted... or even put in at a later time in Excel or another BI tool that we're researching(Alteryx). @Serg any tips or resources as to where I may find such a dynamic query, please? Thanks! – Supernova Jan 30 '23 at 15:29
  • Creating a dynamic query varies by product. What is your DBMS exactly? – Serg Jan 30 '23 at 16:06
  • @Serg Hadoop. our database admins mentioned to me that Hadoop does not use a RDBMS system- everything is a backend with distributed files, and the front end is Hive. Does this make sense? Thanks! – Supernova Jan 30 '23 at 17:14
  • Hive does use a RDBMS for the metastore – OneCricketeer Jan 31 '23 at 15:04

0 Answers0