0

I have a master table

#+-----------+----------+-------------+
#| Name      | Gender   | date        |
#+-----------+----------+-------------+
#|  Tom      |   M      |  2021-02-15 | 
#|  Bob      |   M      |  2021-03-02 |
#|  Kelly    |   F      |  2021-06-01 |

And a daily table , A daily table can have a data with following conditions

1)Totally new records 2)date column updated for existing records

#+-----------+----------+-------------+
#| Name      | Gender   | date        |
#+-----------+----------+-------------+
#|  Tom      |   M      |  2021-03-20 | date updated 
#|  suzen    |   F      |  2021-06-10 | new records 

expected output master table should have all the new records coming in daily plus if any of the name matches with master table then update the new date from daily

#+-----------+----------+-------------+
#| Name      | Gender   | date        |
#+-----------+----------+-------------+
#|  Tom      |   M      |  2021-03-20 | date updated form daily
#|  Bob      |   M      |  2021-03-02 |
#|  Kelly    |   F      |  2021-06-01 |
#|  suzen    |   F      |  2021-06-10 | New record 

For ease, lets take Name as the unique identifier of both tables.

there is a one way to to join both of these table on full outer and get the result

   select  (case when d.name is null or d.name='' then m.name
            when m.name is null or m.name='' then d.name
            else m.name end) as name,
        (case when d.gender is null or d.gender ='' then m.gender 
            when m.gender is null or m.gender ='' then d.gender 
            else m.gender end) as gender,
         (case when d.date is null or d.date ='' then m.date 
            when m.date is null or m.date ='' then d.date
            else d.date end) as date
    Form master m full outer join daily d on 
    on m.name=d.name

What is the way to achieve the expected output in better and more performant way

punam
  • 155
  • 8

1 Answers1

1

To have a new table (dataframe) based on your criteria, your solution is good.
But if you want to update the master table using the daily table, delta supports upsert into a table using merge:

from delta.tables import *

master_table = DeltaTable.forPath(spark, "/path/to/master")
daily_table = DeltaTable.forPath(spark, "/path/to/daily")

master_table.alias("master").merge(
    daily_table.alias("daily"),
    "master.Name = daily.Name") \
  .whenMatchedUpdate(set = { "Gender" : "daily.Gender", "date": "daily.date"} ) \
  .whenNotMatchedInsert(values =
    {
      "Name": "daily.Name",
      "Gender": "daily.Gender",
      "date": "daily.date"
    }
  ) \
  .execute()

or using SQL

MERGE INTO master
USING daily
ON master.Name = daily.Name
WHEN MATCHED THEN
  UPDATE SET master.Gender = daily.Gender, master.date = daily.date
WHEN NOT MATCHED
  THEN INSERT (Name, Gender, date) VALUES (Name, Gender, date)
Hussein Awala
  • 4,285
  • 2
  • 9
  • 23
  • I have two questions instead of DeltaTable.forPath(spark, "/path/to/master") can I used spark.sql('select * from master') and which sql gives me best performance the one which I have used with outer join or the one which you used with merge and match , can you please help me with the reason, I have read the outer/full join is expensive – punam Aug 02 '21 at 07:43
  • from delta.tables import * Looks like it need spark 3.0 version onword , can you help with spark version lower than 3 – punam Aug 03 '21 at 10:59
  • I am not sure if delta.tables is supported in spark 2.x.x, but I'm sure that you can use the second option (by SQL), so instead of loading the data as a delta table, you can load it as a spark dataframe using `spark.read.format("delta").load("...")`, then you can register the tables as temp views and execute the SQL query which merge the tables... if you have a hive/warehouse metastore, don't need to load the dataframes, you can execute the SQL query directly – Hussein Awala Aug 05 '21 at 12:08
  • I am getting same exception if I execute sql "mismatched input 'MERGE' expecting" I found exception post https://stackoverflow.com/questions/62900812/delta-lake-merge-into-statement – punam Aug 05 '21 at 16:31