I am using cloud composer to orchestrate ETL for files arriving in GCS going to BigQuery. I have a cloud function that triggers the dag when a file arrives and the cloud function passes the file name/location to the DAG. In my DAG I have 2 tasks:
1) Use the DataflowPythonOperator
to run a dataflow job that reads the data from text in GCS and transforms it and inputs it into BQ, and 2) moves the file to a failure/success bucket depending on whether the job fails or succeeds.
Each file has a file ID which is a column in the bigquery table. Sometimes a file will be edited once or twice (it’s not a streaming thing where it’s often) and I want to be able to delete the existing records for that file first.
I looked into the other airflow operators, but wanted to have 2 tasks in my DAG before I run the dataflow job:
- Get the file id based on the file name (right now I have a bigquery table mapping file name -> file ID but I can also just bring in a json that serves as the map I guess if that’s easier)
- If the file ID is already present in the bigquery table (the table that outputs the transformed data from the dataflow job), delete it, then run the dataflow job so I have the most updated information. I know one option is to just add a time stamp and only use the most up to date records, but because there could be 1 million records per file and it is not like I am deleting 100 files a day (maybe 1-2 tops) that seems like it could be messy and confusing.
After the dataflow job, ideally before moving the file to the success/failure folder I would like to append to some “records” table saying that this game was inputted at this time. This will be my way to see all the inserts that occurred. I have tried to look for different ways to do this, I am new to cloud composer so I don’t have a clear cut idea of how this would work after 10+ hours of research or else I would post code for input.
Thanks, I am very thankful for everyone's help and apologize if this is not as clear as you would like, the documentation on airflow is very robust but given cloud composer and bigquery is relatively new, it is difficult to learn as thoroughly how to do some GCP specific tasks.