De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

Question

Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.

In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.

We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.

Is there a common implementation for de-duplication of real time streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using the BigQuery API) will the delete always precede the insert, or do the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging environment, followed by a scheduled merge into the destination table? This is a common solution for other column store etl technologies but we have seen no documentation suggesting its use in BigQuery.

score 4 · Accepted Answer · edited May 23 '17 at 11:54

We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.

Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.

If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.

Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.

Thanks for the reply, I'll discuss with the team and get back to you if we have any more questions. — Stewart Spencer, Mar 27 '17 at 21:05

De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

1 Answers1