1

I am trying to use DELETE to remove duplicate records from my BigQuery table.

I found multiple solutions to the above question but most of them use CREATE, REPLACE or SELECT.

The closest solution using DELETE I found was:

BigQuery - DELETE statement to remove duplicates

BigQuery Standard SQL: Delete Duplicates from Table

I have a follow up questions on the below solution:

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
    SELECT AS STRUCT id, MAX(loadTime) loadTime 
    FROM `yourproject.yourdataset.duplicates` where id= '123'
    GROUP BY id)

This statement deletes all the records from the table which does not satisfy the NOT IN condition. For example, if my table looks like below:

Id      Loadtime
123        5
123        4
456        2
321        1

The query above deletes all records except for the first row. How can I modify the query so that it deletes only the 2nd row i.e. it only deletes group by id?

The final output should be:

 Id      Loadtime
 123        5
 456        2
 321        1
Kate Orlova
  • 3,225
  • 5
  • 11
  • 35
TigSh
  • 615
  • 1
  • 6
  • 15

1 Answers1

1

Below should work as per your expectation

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime 
FROM `yourproject.yourdataset.duplicates` 
GROUP BY id)  

so, in your sample - it will delete ONLY second row

Id  Loadtime     
123 4    
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • It is only keeping the 1 record which satisfies the subquery result and deleting the rest. – TigSh Feb 18 '20 at 21:07
  • the rest in same `id` group - right? isn't this what you wanted? – Mikhail Berlyant Feb 18 '20 at 21:09
  • It was deleting rows with Id 456 and 521. I just removed the condition where id= '123' from my query and it only deleted the 2nd row and kept 1st, 3rd and 4th row. Not sure why the where clause would cause this but it works now! Thanks for your help – TigSh Feb 18 '20 at 21:14
  • yes, that's exactly what my answer points to. glad you made it work :o) – Mikhail Berlyant Feb 18 '20 at 22:52