0

I am looking for solution to perform data profiling on bigquery table covering below statistics for each column in table. Some of the columns are ARRAY and STRUCT as given below.

I tried multiple ways to generate dynamic query to cover below scenarios but no luck.. I will greatly appreciate your help/inputs.

Metrics I want to calculate part of this solution are:

  • MIN VALUE
  • MAX VALUE
  • MIN LENGTH OF THE FIELD
  • MAX LENGTH OF THE FIELD
  • NO OF UNIQUE RECORDS FOR EACH FIELD
  • NO OF NULLS IN FIELD
  • NO OF NON NULL VALUES IN FIELD.
  • MIN DATE IN CASE OF DATE OR DATETIME FIELD
  • MAX DATE IN CASE OF DATE OR DATETIME FIELD

Sample Table Data:

Blockquote

Desired output enter image description here

rmesteves
  • 3,870
  • 7
  • 23

3 Answers3

2

This query returns all the columns from a table in a dataset. I excluded STRUCTS, since you only need value columns.

SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name = 'table_name'
  AND data_type NOT LIKE 'STRUCT%'

Using the columns table, we'll generate a SQL query to get all these columns. Here, I only added MIN, MAX and COUNT DISTINCT columns. However, you can add more of them by adding new lines to SELECT part.

SELECT 
  STRING_AGG(
    CONCAT(
      'SELECT "', field_path, '" as field_path, ',
        'CAST(MIN(', field_path, ') as string) as max, ',
        'CAST(MAX(', field_path, ') as string) as min ',
        'COUNT(DISTINCT ', field_path, ') as count_distinct ',
      'FROM ', table_name) ,
    ' UNION ALL \n'
  ) as query
FROM columns

At the end, we'll run this query using EXECUTE IMMEDIATE, since it's a string:

EXECUTE IMMEDIATE (
  query
)

To bring all these queries together, it looks like that:

EXECUTE IMMEDIATE (
  SELECT 
    STRING_AGG(
      CONCAT(
        'SELECT "', field_path, '" as field_path, ',
          'CAST(MIN(', field_path, ') as string) as max, ',
          'CAST(MAX(', field_path, ') as string) as min ',
          'COUNT(DISTINCT ', field_path, ') as count_distinct ',
        'FROM ', table_name) ,
      ' UNION ALL \n'
    ) as query
  FROM (
    SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
    FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
    WHERE table_name = 'table_name'
      AND data_type NOT LIKE 'STRUCT%'
  )
)

PS: It only solves structs for now. Can you show me an example of your ARRAY columns?

Sabri Karagönen
  • 2,212
  • 1
  • 14
  • 28
  • Thanks Sabri.. This is one of the table that has Array columns. CREATE OR REPLACE TABLE `demo.Customer` ( addresses ARRAY>, first_name STRING, dob DATE, last_name STRING, id INT64 ) – Mallik Tiru Jun 08 '20 at 17:53
  • Another DDL with column depth as 2. CREATE OR REPLACE TABLE `demo.Enrollment` ( kafka STRUCT, account STRUCT>, address ARRAY>, legalAcceptance ARRAY>, isGuestCheckout BOOL, gaiyoshomenId STRING> ) – Mallik Tiru Jun 08 '20 at 18:55
  • Thanks @Sabri. I need all columns output including STRUCTS and ARRAY. I modified your query to include STRUCTS and ARRAY but it is returning below error. **Cannot access field address1 on a value with type ARRAY> at [1:89] ** . Query returning error is. `SELECT "account.billingProfile.address1" as field_path, CAST(MIN(account.billingProfile.address1) as string) as max, CAST(MAX(account.billingProfile.address1) as string) as min, COUNT(DISTINCT account.billingProfile.address1) as count_distinct FROM myproject.demo.Enrollment.` – Mallik Tiru Jun 10 '20 at 02:16
  • No, don't remove that condition from where statement. it only removes the parent column, not the child columns. For example, it removes address but keeps address.zip and address.state. For array, I'll work on them whenever I find time. It's more complicated then structs, that's why I postponed it for now. – Sabri Karagönen Jun 10 '20 at 07:29
  • Hi Sabri Karagonen : Let me know if you can provide solution. – Mallik Tiru Jul 02 '20 at 19:07
  • missing a comma after "as min" – Lan Si Apr 19 '21 at 05:23
0

I dont understand what you mean with Min Length and Max Length, but considering the provided data, you could do something like below.

This query have basically two steps:

  1. Create a temporary table with flat data using the WITH clause
  2. Calculate the metrics by running one query for each column and use UNION ALL to compose everything in a single table.

Query:

WITH
  t AS(
  SELECT
    first_name,
    dob,
    last_name,
    a.zip addresses_zip,
    a.state addresses_state,
    a.city addresses_city,
    a.numberOfYears addresses_numberOfYears,
    a.status addresses_status,
    a.phone.primarynumber addresses_phone_primarynumber,
    a.phone.secondary addresses_phone_secondary
  FROM
    <your-table> t,
    t.addresses a 
)

SELECT
  "first_name" AS column,
  COUNT(first_name) total_count,
  COUNT(DISTINCT first_name) total_distinct,
  SUM(
  IF
    (first_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(first_name) AS string) min_value,
  CAST(MAX(first_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "dob" AS column,
  COUNT(dob) total_count,
  COUNT(DISTINCT dob) total_distinct,
  SUM(
  IF
    (dob IS NULL,
      1,
      0)) total_null,
  CAST(MIN(dob) AS string) min_value,
  CAST(MAX(dob) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "last_name" AS column,
  COUNT(last_name) total_count,
  COUNT(DISTINCT last_name) total_distinct,
  SUM(
  IF
    (last_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(last_name) AS string) min_value,
  CAST(MAX(last_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.zip" AS column,
  COUNT(addresses_zip) total_count,
  COUNT(DISTINCT addresses_zip) total_distinct,
  SUM(
  IF
    (addresses_zip IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_zip) AS string) min_value,
  CAST(MAX(addresses_zip) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.state" AS column,
  COUNT(addresses_state) total_count,
  COUNT(DISTINCT addresses_state) total_distinct,
  SUM(
  IF
    (addresses_state IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_state) AS string) min_value,
  CAST(MAX(addresses_state) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.city" AS column,
  COUNT(addresses_city) total_count,
  COUNT(DISTINCT addresses_city) total_distinct,
  SUM(
  IF
    (addresses_city IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_city) AS string) min_value,
  CAST(MAX(addresses_city) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.numberOfYears" AS column,
  COUNT(addresses_numberOfYears) total_count,
  COUNT(DISTINCT addresses_numberOfYears) total_distinct,
  SUM(
  IF
    (addresses_numberOfYears IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_numberOfYears) AS string) min_value,
  CAST(MAX(addresses_numberOfYears) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.status" AS column,
  COUNT(addresses_status) total_count,
  COUNT(DISTINCT addresses_status) total_distinct,
  SUM(
  IF
    (addresses_status IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_status) AS string) min_value,
  CAST(MAX(addresses_status) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.phone.primarynumber" AS column,
  COUNT(addresses_phone_primarynumber) total_count,
  COUNT(DISTINCT addresses_phone_primarynumber) total_distinct,
  SUM(
  IF
    (addresses_phone_primarynumber IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_primarynumber) AS string) min_value,
  CAST(MAX(addresses_phone_primarynumber) AS string) max_value
FROM
  t 

UNION ALL

SELECT
  "addresses.phone.secondary" AS column,
  COUNT(addresses_phone_secondary) total_count,
  COUNT(DISTINCT addresses_phone_secondary) total_distinct,
  SUM(
  IF
    (addresses_phone_secondary IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_secondary) AS string) min_value,
  CAST(MAX(addresses_phone_secondary) AS string) max_value
FROM
  t
rmesteves
  • 3,870
  • 7
  • 23
  • I am looking for solution to build above queries dynamically by handling ARRAY columns because all ARRAY columns need to be unnest before accessing. Based on table provided in above comments let us know how to handle ARRAY columns dynamically. – Mallik Tiru Jun 09 '20 at 18:51
0

I highly recommend using a low-code no-code tool like Telmai for profiling entire tables in BigQuery within minutes.

All of the metrics you have mentioned are OOTB, so dont need to write any code and you and there is free tier so almost no cost to you.

https://telm-ai.webflow.io/data-profiling-and-observability