There is a process which exports some CSV files to GCS and puts the current datetime into the path, e.g.:
gs://my_gcs_bucket/path/to/export/<current_timestamp>/exported_file_1.csv
gs://my_gcs_bucket/path/to/export/<current_timestamp>/exported_file_2.csv
I want to create an external table for these files before I cast its columns and merge with other tables.
The problem is that I can't implement a DAG in Airflow which can handle the changing timestamps.
I can create an external table by specifying the path (with the current timestamp), but for tomorrow's exports this external table will not read the new files. But I will need tomorrow the same project_name.dataset.tablename
for processing.
from airflow.contrib.operators.bigquery_operator import BigQueryCreateExternalTableOperator
CreateExternalBqTable = BigQueryCreateExternalTableOperator(
task_id = "CreateExternalBqTable",
field_delimiter = '|',
destination_project_dataset_table = f'{project_name}.{dataset}.{tablename}',
bucket = BUCKET_URI,
source_objects = [ 'path/to/export/2021-12-12-12-12-12/exported_file_*' ],
schema_fields = generate_custom_struct()
)
If tomorrow I'll try to run the same table creation task again it will fail because the external table already exists. I can delete the existing external table then recreate it again to ensure it does not fail the external table creation task, but if it's already deleted then this will fail:
from airflow.providers.google.cloud.operators.bigquery import BigQueryDeleteTableOperator
DeleteExternalBqTable = BigQueryDeleteTableOperator(
task_id = "DeleteExternalBqTable",
deletion_dataset_table = f'{project_name}.{dataset}.{tablename}',
)
What is the good pattern here?
Should I always delete and recreate the external table just because of the new timestamps?
It seems to me very bad and error prone pattern.
Or could I change the path under an existing external table somehow?
I would be much happier if I could separate the initialization and deletion phases of my external table from the daily runs and not always delete or create them. I plan to initialize these tables only once and move away the CSV files if I'm done with the processing and keep the external tables empty until the next run.
(I need a working solution for Airflow 1.x)