I am looking for guidance on how to manage incremental load into BigQuery. Here is our process
- We receive csv files in GCS. As soon as it arrives in GCS we upload it to corresponding tables in the staging area with ingestion timestamp
- From the staging area we join tables data together & push it to the denormalized table in the Reporting layer (we use SQL for ETL)
My question is how do I make sure my SQL picks up only unprocessed data from the staging area?
One thought I have is to create etl_control_table where I manage header/detail/ref info which has been processed etc. However, I am looking for suggestions if there is some clever way to accomplishing the same (based on ingestion timestamp etc.)
Note: All my tables are partitioned on file_date