1

I have around 50 tables under a BigQuery project created by an ETL tool from different platforms. There is no primary key in the tables and few tables have columns over 100. Each platform has different fields and different data types and the team’s requirement is to create a one big master table of all the tables combined

I am manually accomplishing this task by listing all the columns of the tables in an Excel and null out fields which are not present in the table to form a query for each table and UNION ALL together to create a master table. I then scheduled it in BigQuery to refresh every day.

Ex:

Tbl A   Tbl B   Tbl C
char1   char1   char1
num1    num3    num1
char2   char2   char2
    char5   num2
    num2    num3
        char3
        char4

Manual Query:

With mast_tbl as (
select concat(cast(row_number() over (partition by date) as string), ' | ', ' TblA) as pk_client , 
char1,
num1,
char2,
null as num3,
cast(null as string) as char5,
null as num2,
cast(null as string) as char3,
cast(null as string) as char4

FROM `bigquery-project-XXXX.export_TblA.all _data_view`

UNION ALL

select concat(cast(row_number() over (partition by date) as string),' | ', ' TblB) as pk_client , 
char1,
null as num1,
char2,
num3,
char5,
num2,
cast(null as string) as char3,
cast(null as string) as char4

FROM `bigquery-project-XXXX.export_TblB.all _data_view`

select concat(cast(row_number() over (partition by date) as string), ' | ', ' TblC) as pk_client , 
char1,
num1,
char2,
num3,
cast(null as string) as char5,
num2,
char3,
char4

FROM `bigquery-project-XXXX.export_TblC.all _data_view`

)
Select * from mast_tbl

This works but the amount of manual work is huge especially if we are adding a new table or new column or removing a column this would make me change at each table query for UNION ALL to work. So, I was wondering if there is any way we can automate a script or any other way we should do this task.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
ABY
  • 393
  • 2
  • 11
  • 2
    *"the team’s requirement is to create a one big master table of all the tables combined"* I have to ask why? – Nick Sep 14 '19 at 08:29

1 Answers1

1

You can generate the query from INFORMATION_SCHEMA tables. The query itself will be much messier, but you can regenerate it easily.

This also requires that the tables all be in the same dataset so start by creating a view to each of the tables.

The query generation idea is:

with tables as (
      select 'vw_export_TblA' as table_name union all
      select 'vw_export_TblB' as table_name union all
      . . .
     ),
     columns as (
      select 'char1' as column_name, 'string' as type union all
      select 'num1' as column_name, 'int64' as type union all
      . . .
     ) 
select string_agg(table_query, ' union all ')  
from (select concat('select ',
                    string_agg(coalesce(ic.column_name, concat('cast(null as ', c.type, ')')), ', ')
                    ' from ',
                    table_name
                   ) as table_query
      from tables t cross join
           columns c left join
           `?.INFORMATION_SCHEMA.COLUMNS` ic
     on ic.table_name = t.table_name and
              ic.column_name = c.column_name
     group by t.table_name
    ) c;

This idea is NOT that the result is pretty. The idea is that you can create a table or view from the result. And that you can control the tables and columns by expanding the lists in the CTEs.

It will take you a bit of time to get this query just right for your environment -- longer than any one manual iteration. However, the next time you have to change a type or column, it will be much easier.

Note: You could also put this together in a language such as Python. Simply allowing newlines in strings makes the result much prettier.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786