How to perform ETL on BigQuery avoiding duplications?

Question

I have a product table of: id createdOn UpdatedOn and 76 more columns.

createdOn UpdatedOn are TIMESTAMP. createdOn is the partition field.

Each ETL loads records from storage to product table (append).

Once ETL finishes I have duplication in the product table.

Example:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

I want to remove the duplicated ids. The record with the most recent updatedOn needs to stay, rest needs to be removed.

I acted according to the suggestion here: Google BQ - how to upsert existing data in tables? This is my query:

   DELETE FROM `storage.prodcuts` AS d
   WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY createdOn, id ORDER BY updatedOn DESC)
           FROM `storage.prodcuts` AS d2
           WHERE d.id = d2.id ) > 1;

This doesn't work:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

Also tried:

 delete FROM `storage.prodcuts` as p
  ( SELECT ROW_NUMBER() OVER (PARTITION BY createdOn, id  order by updatedOn DESC) as rn , id FROM `storage.prodcuts` ) as t
WHERE t.rn> 1 and p.id=t.id;

Gives:

Syntax error: Unexpected "(" at [3:7]

I assume that BigQuery wants me to join between the delete table and the row_number table (though its the same one)? how can I fix this?

Have you tried using the MERGE statement? It's part of the same DML documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement — Elliott Brossard, Sep 17 '18 at 06:04
@ElliottBrossard I'm not sure this is doable. I have 76 columns. Each one of them can be updated. This is a nightmare to specify everything manually. — Programmer120, Sep 17 '18 at 06:30

Pentium10 · Answer 1 · 2018-09-17T07:01:33.213

0

Maintain two table, one that you have right now: everything in, have another table which is the cleanup version, you can MERGE from the 1st to 2nd table based on the id column only. This way the latest row will be in the second table, and with the MERGE statement it will be overwritten all the time.

You can now even Schedule Queries so your MERGE statements can be executed automatically every X times.

edited Sep 17 '18 at 07:01

answered Sep 17 '18 at 06:47

Pentium10

204,586
122
423
502

This instructions are written everywhere yet none of the places show actual example of how to do it with a table with dozes of columns. I am more than happy to adopt this method but from what I see such query will be a nightmare. say I have production_products and recent_products (which is the ETL table) . How do you suggest to merge recent_products (ETL) into production_products when having 76 columns. Should I specify 76 columns in the Insert and Update statement? :/ – Programmer120 Sep 17 '18 at 06:51
Please note that in SQL only SELECT * is something where you omit naming the columns, in all others INSERT/UPDATE/MERGE you need to name all the columns. See the answer here that applies to you as well: https://stackoverflow.com/questions/51194105/cannot-query-over-table-without-a-filter-that-can-be-used-for-partition-eliminat – Pentium10 Sep 17 '18 at 06:59
I'm using Apache Airflow for scheduling tasks. I need only statements where I can run as SQL. Do you see a way where I can make my delete work? – Programmer120 Sep 17 '18 at 07:05
Write the select to a temp table. Use that in your DELETE. Self-referencing is not possible. – Pentium10 Sep 17 '18 at 07:15
Can you explain why something like" MERGE dataset.products T USING dataset.products_etl S ON T.id= S.id" cant be done ? Why I must specify the Insert and the Update? I want all columns to be update and if id doesn't exists then insert the whole row. – Programmer120 Sep 17 '18 at 07:17
The SQL standard doesn't define MERGE like that. – Pentium10 Sep 17 '18 at 07:57
and what about legacy SQL? – Programmer120 Sep 17 '18 at 08:00

score 0 · Answer 2 · answered Sep 17 '18 at 07:06

0

If you don't have rows with duplicate (id, updatedOn) pairs, or can resolve such duplicates in other ways, you can do something simple like below to achieve this:

DELETE FROM `storage.prodcuts` AS d
WHERE EXISTS (SELECT 1 FROM `storage.prodcuts` p2 
              WHERE p2.id = d.id AND p2.updatedOn > d.updatedOn)

But from maintainability perspective, the Pentium10's solution of using two tables I think is way superior.

answered Sep 17 '18 at 07:06

Michael Entin

7,189
3
21
26

I can have duplicates.. note my example (id=1). Pentinum10 answer while interesting still force me to specify 76 columns in the update and in the insert. it's one thing mention them for insert but update it too much. I must do : update set id=id,createdOn=createdOn..... that will take hours! – Programmer120 Sep 17 '18 at 07:14
I honestly don't understand why there isn't "MERGE dataset.products T USING dataset.products_etl S ON T.id= S.id" Why I must specify everything? Simply insert id that doesn't exist and what exists update the whole row – Programmer120 Sep 17 '18 at 07:16

How to perform ETL on BigQuery avoiding duplications?

2 Answers2