0

I want to delete identical rows from delta table which doesnt have any primary key.

how to achieve this scenario?

if i have a delta table like below:

enter image description here

i need to remove the duplicates by comparing entire row and i need result like

enter image description here

how to achieve this scenario?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • I feel like the only way you could do this is to a) put all of the DISTINCT duplicate records into a temporary/working table, then b) loop through the temporary table records and delete all of the corresponding duplicates from the source table (ie. source records that match the temporary table values), then c) insert all of the DISTINCT records (from your temporary table) into the main table. It's a bit yucky, but would achieve the desired result. And, of course, if you have the ability to restructure your main table with a primary key, then that will be useful in the future – Craig Oct 10 '22 at 04:37
  • @surely there's a better way? e.g. Mysql - google delete Join. – BenKoshy Oct 10 '22 at 05:32
  • 1
    `spark.read.format("delta").load(path).dropDuplicates().write.format("delta").mode("overwrite").save(path)` ? – Alex Ott Oct 10 '22 at 06:28

0 Answers0