BigQuery - DELETE statement to remove duplicates

Question

There are plenty of great posts on SQL that selects unique rows and write (truncates) a table so the dus are removed. e.g

WITH ev AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
)
SELECT
  * EXCEPT(rowNum)
FROM
  ev
WHERE rowNum = 1

I was trying to explore this slightly differently using DML and DELETE (e.g if you don't want to use a BQ savedQuery, just execute SQL). What I want to do is roughly:

WITH dup_events AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
)
DELETE FROM
  dup_events
WHERE rowNum > 1

but got this error in the console:

Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]

Can this be achieved (standardSQL) using DELETE?

score 17 · Answer 1 · answered Jan 06 '18 at 22:43

Below actually :o) works

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
        SELECT AS STRUCT id, MAX(loadTime) loadTime 
        FROM `yourproject.yourdataset.duplicates` 
        GROUP BY id)

Note: it assumes that loadTime is also unique - meaning if for given id there are more than one record with latest loadTime - they all will be preserved

score 5 · Accepted Answer · edited Apr 14 '22 at 16:53

5

From the syntax documentation, the argument to DELETE needs to be a table, and there is no provision for using a WITH clause. This makes sense given that you can't delete from what is essentially a logical view (a CTE). You can express what you want by putting the logic inside the filter, e.g.

DELETE
FROM duplicates AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC)
       FROM `duplicates` AS d2
       WHERE d.id = d2.id AND d.loadTime = d2.loadTime) > 1;

edited Apr 14 '22 at 16:53

Daniel

1,005
1
16
22

answered Jan 06 '18 at 19:29

Elliott Brossard

32,095
2
67
99

Somehow I thought analytical functions are not allowed in where clause . I am on go now and cannot check by myself - can you confirm it actually works? – Mikhail Berlyant Jan 06 '18 at 19:34
In terms of tradeoffs, what is your view on using this method versus select and override? DML Quota is of course one, and soon this will be supported on Partitioned tables (Im on the beta). One thing that seemed a plus was risk of data loss - I wasnt quite sure of with the other approach is if you have the table being deduped was loaded by another process at the same time you executed a dedup query (large table can take minutes to execute such a dedup query)- is there a risk of missing the newly loaded data if the query and load times are missaligned? Thoughts on this? Cheers! – Kurt Maile Jan 06 '18 at 19:37
actually, I just tried this and got this error - Error: Analytic function not allowed in WHERE clause at [5:4] – Kurt Maile Jan 06 '18 at 19:39
I think it can still work, but I need to try running an actual query :) Let me see... – Elliott Brossard Jan 06 '18 at 19:45
Sorry, I don't have a good environment to test this out. No laptop and it's difficult from my phone :( Can you see if this edit works? It's not as efficient as the analytic function approach would have been, but it might work at least... – Elliott Brossard Jan 06 '18 at 20:25
5

Thanks Elliott - I got this error this time Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. – Kurt Maile Jan 06 '18 at 21:10
Hi Elliott did you manage to tweak your query to avoid the error above to get it to work? – Kurt Maile Jan 07 '18 at 08:38
this doesn't work with this error `Correlated subqueries that reference other tables are not supported unless they can be de-correlated` – Reza Apr 05 '23 at 00:38

score 5 · Answer 3 · edited Apr 18 '22 at 18:14

This has to be the easiest way:

create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)

If you have an array data type, try this:

-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3]  as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);

-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
  SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)]  col
  FROM DW.pmoTest tbl
  GROUP BY ID
  )
);

I get this error when tying the part in parenthesis : Column instance of type ARRAY cannot be used in SELECT DISTINCT — makmak, Sep 02 '20 at 00:35

score 1 · Answer 4 · edited Apr 26 '22 at 01:39

Those answers above works only for small size table. If you have a large size partition table, and only want to remove duplicates in a given range, use the SQL below:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
    
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles");
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
    
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )
  
) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
  
WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

BigQuery - DELETE statement to remove duplicates

4 Answers4

Linked