Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
- Is there a common implementation for de-duplication of real time streaming beyond the use of the tableId?
- If we attempt a delsert (via an delete followed by an insert using the BigQuery API) will the delete always precede the insert, or do the operations arrive asynchronously?
- Is it possible to implement real time streaming into a staging environment, followed by a scheduled merge into the destination table? This is a common solution for other column store etl technologies but we have seen no documentation suggesting its use in BigQuery.