0

Goal: run checks.yml on all Tables in Database, implicitly / dynamically (not naming 100s of Tables).

Following Soda's quick start, I've completed sections:

  • Install Soda Core
  • Connect Soda Core to a data source - configuration.yml

Now I'm following Write a check and run a scan - checks.yml.


Problem

However, the documentation only gives examples for checking one Table each.

4 Checks

  1. Sum of Tables (in Database)
  2. Sum of Columns (across all Tables, in Database)
  3. Sum of Tables' descriptions exist
  4. Sum of Columns' descriptions exist

Queries return a COUNT().


So far, checks.yml:

# checks for MY_DATABASE:
  sql_metrics:
  name: num_tables, num_columns
    sum_tables query: |
      SELECT COUNT(*)
      FROM information_schema.tables
      WHERE table_schema = '*';
    sum_columns query: |
      SELECT COUNT(*)
      FROM information_schema.columns
      WHERE table_name = '*';
    sum_tables_descriptions query: |
      -- SQL
    sum_columns_descriptions query: |
      -- SQL
DanielBell99
  • 896
  • 5
  • 25
  • 57
  • 1
    `'*'` has meaning as a wildcard only in [a few specific contexts](https://stackoverflow.com/a/73495536/20860). It is not a get-out-of-jail-free card you can use anytime you want to use a loop or you don't know a value. – Bill Karwin Sep 01 '22 at 16:21
  • what do you mean with descriptions? May be comments in postgres objects and columns? – Pepe N O Sep 06 '22 at 15:31
  • If you need to count all tables and columns then eliminate the where clause. – Meyssam Toluie Sep 11 '22 at 06:29

1 Answers1

1

Your checks file should look like

checks for MY_DATABASE:
  - sum_tables > 0:
      sum_tables query: |
        SELECT COUNT(*)
        FROM information_schema.tables
        WHERE table_schema ~~ '%'
  - sum_columns > 0:
      sum_columns query: |
        SELECT COUNT(*)
        FROM information_schema.columns            

Your checks must be based on conditions(to be checked or verified) or filters that are not available yet on documentation. ~~ means like and % is the wildcard. Although condition all and no condition gives same result, therefore where clause is not neccessary.

Soda Core 3.0.5
Scan summary:
2/2 checks PASSED: 
    MY_DATABASE in postgres
      sum_tables > 0 [PASSED]
      sum_columns > 0 [PASSED]
All is good. No failures. No warnings. No errors.

Or you could dynamically create checks files with this script, using a list of datasets(tables) with 'for each dataset T: ' clause, like:

import psycopg2
import time

#File name with timestamp
filePath = '.'
timestr = time.strftime("%Y-%m-%d-%-H%M%S")
fileName = 'checks-' + timestr + '.yml'

try:
    conn = psycopg2.connect("dbname=postgres user=postgres password=yourpassword host=localhost")
    cur = conn.cursor()
    cur.execute("SELECT '    - '||table_name||'\n' FROM information_schema.tables WHERE table_schema = 'public' order by 1 limit 5;")
    row = cur.fetchone()

    with open(fileName,'w') as f:
        line="for each dataset T:\n  datasets:\n"
        f.write(line)        
        while row is not None:            
            f.write(row[0])
            row = cur.fetchone()            
        cur.close()
        line="  checks:\n    - row_count > 0"
        f.write(line)
except (Exception, psycopg2.DatabaseError) as error:
    print(error)
finally:
    if conn is not None:
        conn.close()
Pepe N O
  • 1,678
  • 1
  • 7
  • 11