I'm trying to use Google's Cloud Data Fusion (CDF) to perform an ETL of some OLTP data in PostGres into BigQuery (BQ). We will copy the contents of a few select tables into an equivalent table in BQ every night - we will add one column with the datestamp.
So imagine we have a table with two columns A & B, and one row of data like this in PostGres
|--------------------|
| A | B |
|--------------------|
| egg | milk |
|--------------------|
Then over two days, the BigQuery table would look like this
|-------------------------------|
| ds | A | B |
|-------------------------------|
| 22-01-01 | egg | milk |
|-------------------------------|
| 22-01-02 | egg | milk |
|-------------------------------|
However, I'm worried that the way I am doing this in CDF is not idempotent and if the pipeline runs twice I'll have duplicate data for a given day in BQ (not desired)
One idea is to delete rows for that day in BQ before doing the ETL (as part of the same pipeline). However, not sure how to do this, or if it is best practice. Any ideas?