I need to update delta table on the basis of update delta table rows.
Update table (source_df)
+----------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+----------------------------------------------------------
| 1|sravan|delhi |false |02/02/2023 |02/02/2023|
| 3|rohith|jaipur |false |02/02/2023 |02/02/2023|
Delta table (delta_df)
+----------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+----------------------------------------------------------
| 1|sravan|delhi |false |25/01/2023 |25/01/2023|
| 2|ojasvi|patna |false |25/01/2023 |25/01/2023|
| 3|rohith|jaipur |false |25/01/2023 |25/01/2023|
I want to update the delta table (both are delta table instances and not dataframes)
Final table would look like this,
Delta table (Updated)
+----------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+----------------------------------------------------------
| 1|sravan|delhi |false |25/01/2023 |25/01/2023|
| 2|ojasvi|patna |true |25/01/2023 |02/02/2023|
| 3|rohith|jaipur |false |25/01/2023 |25/01/2023|
Let the keys for both tables be ID and NAME, I tried the following,
delta_df.merge(
source_df,
f"delta_df.DELETE_FLAG= 0 AND (delta_df.ID <> source_df.ID AND delta_df.NAME <> source_df.NAME)"
).whenMatchedUpdate(
set = {
"DELETE_FLAG" : "1",
"UPDATE_DATE" : "source_df.UPDATE_DATE"
}
).execute()
However, the above code is taking infinite time to execute. I am new to PySPark and not sure what is the most efficient way to deal with the problem at hand.
Note: They are not dataframes.