Delete a row from target spark delta table when multiple columns in a row of source table matches with same columns of a single row in target table

Question

I want to update my target Delta table in databricks when certain column values in a row matches with same column values in Source table.

The problem is when I have multiple rows in source table that matches one row in target Delta table.

This is a scenario where primary keys of two or more rows in source table matches with primary keys of one row in delta table. I have tried to replicate the scenario below:

    sql="""
    MERGE INTO """ + targetTable + """ TGT USING """ + tempView + """ SRC ON TGT.id = SRC.id and TGT.category != SRC.category and TGT.startdt = SRC.startdt
    WHEN MATCHED THEN DELETE""" 

    targetDF = spark.createDataFrame([{"id": "5001","category": "N1","startDt": "2019-09-30 00:00:00.000"}])
    sourceDF = spark.createDataFrame([{"id": "5001","category": "E1","startDt": "2019-09-30 00:00:00.000"},{"id": "5001","category": "B1","startDt": "2019-09-30 00:00:00.000"}])
    targetDF.write.format("delta").mode("overwrite").saveAsTable("test.targetDF")
    sourceDF.createOrReplaceTempView("tempView")

    sqlOut=spark.sql(sql)
    display(spark.sql("select * from test.targetDelta"))

I have tried left join(targetTable left join sourceTable) on two tables where my id and startDt matches to get one row in my targetTable which I would like to delete but couldn't figure out how to do that.

    spark.sql("""Select TGT.id from test.targetDF TGT left join  tempView  SRC ON TGT.id = SRC.id and TGT.startDt= SRC.startDt""")

Thanks in advance.

score 1 · Accepted Answer · answered Jun 08 '20 at 17:27

    package spark

import org.apache.spark.sql.SparkSession

object ap1 extends App {
  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  case class D(id: String, category: String, startDt: String)

  val targetDF = Seq(D("5001", "N1","2019-09-30 00:00:00.000"))
  .toDF()
  val sourceDF = Seq(D("5001", "E1", "2019-09-30 00:00:00.000"),
    D("5001","B1","2019-09-30 00:00:00.000"))
    .toDF()


  val res = targetDF.join(sourceDF, targetDF.col("id") === sourceDF.col("id") &&
    targetDF.col("startDt") === sourceDF.col("startDt")  , "left_semi")

  res.show(false)
//  +----+--------+-----------------------+
//  |id  |category|startDt                |
//  +----+--------+-----------------------+
//  |5001|N1      |2019-09-30 00:00:00.000|
//  +----+--------+-----------------------+

}

Thanks a ton. left_semi join works perfectly. I will add spark.sql and pyspark version of it with Delete operation on target table — Saikat, Jun 08 '20 at 19:23

score 0 · Answer 2 · answered Jun 08 '20 at 19:32

I have executed the answer provided by @mvasyliv in spark.sql and and added the delete operation of the row from target table whenever row in target table matches with multiple rows in source table.

spark.sql version

   spark.sql("DELETE FROM MDM.targetDF a WHERE EXISTS(Select * from MDM.targetDF TGT left semi join  tempView  SRC ON TGT.id = SRC.id and TGT.startDt = SRC.startDt)").show()

Pyspark Version:

    from pyspark.sql.functions import *

    sql="""
    MERGE INTO """ + targetTable + """ TGT USING """ + tempView + """ SRC ON TGT.id = SRC.id and TGT.category != SRC.category and TGT.startdt = SRC.startdt
    WHEN MATCHED THEN DELETE""" 

    finalDF = targetDF.join(sourceDF, ((targetDF.id == sourceDF.id) & (targetDF.startDt == sourceDF.startDt)), "left_semi")

    finalDF.createOrReplaceTempView("tempView")

    sqlOut=spark.sql(sql)
    display(spark.sql("select * from test.targetDelta"))

Delete a row from target spark delta table when multiple columns in a row of source table matches with same columns of a single row in target table

2 Answers2