How to provide condition for UPSERT in databricks - PySpark

Question

I have a table demo_table_one in which I want to upsert the following values

data = [
    (11111 , 'CA', '2020-01-26'),
    (11111 , 'CA', '2020-02-26'),
    (88888 , 'CA', '2020-06-10'),
    (88888 , 'CA', '2020-05-10'),
    (88888 , 'WA', '2020-07-10'),
    (88888 , 'WA', '2020-07-15'),
    (55555 , 'WA', '2020-05-15'),
    (55555 , 'CA', '2020-03-15'),
    ]

columns = ['attom_id', 'state_code', 'sell_date']
df = spark.createDataFrame(data, columns)

The logic is that for each attom_id & state_code we only want the latest sell_date So the data in my table should be like

[
    (11111 , 'CA', '2020-02-26'),
    (88888 , 'CA', '2020-06-10'),
    (88888 , 'WA', '2020-07-15'),
    (55555 , 'CA', '2020-03-15')
]

and I have the following code to do it

from delta.tables import DeltaTable
deltaTable = DeltaTable.forName(spark, "demo_table_one") 

#perform the UPSERT
(deltaTable.alias('orginal_table')
    .merge(df.alias('update_table'), 
   "orginal_table.state_code = update_table.state_code and orginal_table.attom_id = update_table.attom_id")
    .whenNotMatchedInsertAll()
    .whenMatchedUpdateAll("orginal_table.sell_date < update_table.sell_date")
    .execute())

But this inserts all the values in the table

whenNotMatchedInsertAll... – thebluephantom Apr 13 '22 at 19:33 — thebluephantom, Apr 13 '22 at 19:33

thebluephantom · Answer 1 · 2022-04-14T19:07:32.527

You need to do an SQL query first on the input to get the records with max value, appropriately, first.

Your .whenNotMatchedInsertAll() reveals that all records are not found and thus inserted.

I find the docs not so great on Databricks to be honest, but this is what I would do (you can do the SQL before as well):

data = [
    (11111 , 'CA', '2020-01-26'),
    (11111 , 'CA', '2020-02-26'),
    (88888 , 'CA', '2020-06-10'),
    (88888 , 'CA', '2020-05-10'),
    (88888 , 'WA', '2020-07-10'),
    (88888 , 'WA', '2020-07-15'),
    (55555 , 'WA', '2020-05-15'),
    (55555 , 'CA', '2020-03-15'),
    ]

columns = ['attom_id', 'state_code', 'sell_date']
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("newer_data")
 
spark.sql(""" MERGE INTO DEMO_TABLE_ONE
              USING ( SELECT attom_id, state_code, max(sell_date) as sell_date from newer_data group by attom_id, state_code) AS NEWER
              ON DEMO_TABLE_ONE.attom_id = NEWER.attom_id AND DEMO_TABLE_ONE.state_code = NEWER.state_code
                WHEN MATCHED THEN
                  UPDATE SET *
                WHEN NOT MATCHED THEN
                  INSERT *
          """)

How to provide condition for UPSERT in databricks - PySpark

1 Answers1