I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table. One approach would be to -
- Collect the set of keys from the incremental data (lets call it
INCR_KEYS
) - Run a query on the lines of -
SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS)
- and store the results in a new table. - Append the incremental data to the new table.
My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.
Is there a better way of achieving the same without creating a duplicate table?