How to modify the source file path of a BigQuery external table in Airflow?

Question

There is a process which exports some CSV files to GCS and puts the current datetime into the path, e.g.:

gs://my_gcs_bucket/path/to/export/<current_timestamp>/exported_file_1.csv
gs://my_gcs_bucket/path/to/export/<current_timestamp>/exported_file_2.csv

I want to create an external table for these files before I cast its columns and merge with other tables.

The problem is that I can't implement a DAG in Airflow which can handle the changing timestamps.

I can create an external table by specifying the path (with the current timestamp), but for tomorrow's exports this external table will not read the new files. But I will need tomorrow the same project_name.dataset.tablename for processing.

from airflow.contrib.operators.bigquery_operator import BigQueryCreateExternalTableOperator

CreateExternalBqTable = BigQueryCreateExternalTableOperator(
    task_id                             = "CreateExternalBqTable",
    field_delimiter                     = '|',
    destination_project_dataset_table   = f'{project_name}.{dataset}.{tablename}',
    bucket                              = BUCKET_URI,
    source_objects                      = [ 'path/to/export/2021-12-12-12-12-12/exported_file_*' ],
    schema_fields                       = generate_custom_struct()
)

If tomorrow I'll try to run the same table creation task again it will fail because the external table already exists. I can delete the existing external table then recreate it again to ensure it does not fail the external table creation task, but if it's already deleted then this will fail:

from airflow.providers.google.cloud.operators.bigquery import BigQueryDeleteTableOperator

DeleteExternalBqTable = BigQueryDeleteTableOperator(
    task_id                 = "DeleteExternalBqTable",
    deletion_dataset_table  = f'{project_name}.{dataset}.{tablename}',
)

What is the good pattern here?
Should I always delete and recreate the external table just because of the new timestamps?
It seems to me very bad and error prone pattern.

Or could I change the path under an existing external table somehow?

I would be much happier if I could separate the initialization and deletion phases of my external table from the daily runs and not always delete or create them. I plan to initialize these tables only once and move away the CSV files if I'm done with the processing and keep the external tables empty until the next run.

(I need a working solution for Airflow 1.x)

score 2 · Answer 1 · answered Dec 05 '21 at 12:16

In BigQueryDeleteTableOperator you can use ignore_if_missing argument as documented in Airflow docs.

But if you don't want to recreate the table every time consider using idempotent operator BigQueryCreateEmptyTableOperator to create the table (it will skip creation if table already exists) and then GoogleCloudStorageToBigQueryOperator to load the data. In this way you will separate the creation and load

How to modify the source file path of a BigQuery external table in Airflow?

1 Answers1

Linked