I have a product table of:
id createdOn UpdatedOn
and 76 more columns.
createdOn UpdatedOn
are TIMESTAMP
.
createdOn
is the partition field.
Each ETL loads records from storage to product table (append).
Once ETL finishes I have duplication in the product table.
Example:
id createdOn, updatedOn, stock, status
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
3 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
3 2018-09-14 14:14:24.305676 2018-09-15 10:00:00.000000 7 5
I want to remove the duplicated id
s. The record with the most recent updatedOn
needs to stay, rest needs to be removed.
I acted according to the suggestion here: Google BQ - how to upsert existing data in tables? This is my query:
DELETE FROM `storage.prodcuts` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY createdOn, id ORDER BY updatedOn DESC)
FROM `storage.prodcuts` AS d2
WHERE d.id = d2.id ) > 1;
This doesn't work:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Also tried:
delete FROM `storage.prodcuts` as p
( SELECT ROW_NUMBER() OVER (PARTITION BY createdOn, id order by updatedOn DESC) as rn , id FROM `storage.prodcuts` ) as t
WHERE t.rn> 1 and p.id=t.id;
Gives:
Syntax error: Unexpected "(" at [3:7]
I assume that BigQuery wants me to join between the delete table
and the row_number
table (though its the same one)? how can I fix this?