How to delete duplicates from delta table which doesnt have any primary key

Asked Oct 10 '22 at 03:59

Active Oct 10 '22 at 06:27

Viewed 824 times

I want to delete identical rows from delta table which doesnt have any primary key.

how to achieve this scenario?

if i have a delta table like below:

i need to remove the duplicates by comparing entire row and i need result like

how to achieve this scenario?

edited Oct 10 '22 at 06:27

Alex Ott

asked Oct 10 '22 at 03:59

Harshith K R

I feel like the only way you could do this is to a) put all of the DISTINCT duplicate records into a temporary/working table, then b) loop through the temporary table records and delete all of the corresponding duplicates from the source table (ie. source records that match the temporary table values), then c) insert all of the DISTINCT records (from your temporary table) into the main table. It's a bit yucky, but would achieve the desired result. And, of course, if you have the ability to restructure your main table with a primary key, then that will be useful in the future – Craig Oct 10 '22 at 04:37
@surely there's a better way? e.g. Mysql - google delete Join. – BenKoshy Oct 10 '22 at 05:32
1

`spark.read.format("delta").load(path).dropDuplicates().write.format("delta").mode("overwrite").save(path)` ? – Alex Ott Oct 10 '22 at 06:28

0 Answers0