0

I am having a databricks delta table created on data lake storage which holds data as shown below.

db_name table_name  location                    table_format    table_type  load_ts
--------------------------------------------------------------------------------------------------------
abc     table1      dbfs:/mnt/abc/data/table1   delta           EXTERNAL    2022-09-14T18:48:02.859+0000
abc     table2      dbfs:/mnt/abc/data/table2   delta           EXTERNAL    2022-09-14T18:48:02.859+0000
xyz     table1      dbfs:/mnt/xyz/data/table1   delta           EXTERNAL    2022-09-14T18:48:02.859+0000
xyz     table2      dbfs:/mnt/xyz/data/table2   delta           EXTERNAL    2022-09-14T18:48:02.859+0000
xyz     table3      dbfs:/mnt/xyz/data/table3   delta           EXTERNAL    2022-09-14T18:48:02.859+0000
--------------------------------------------------------------------------------------------------------

Currently I am running this script daily to overwrite the complete table in databrcicks. But my requirement is, this records should update only if there is any change occurred on that particular record else leave it. If there is any new records, it has to be added.

For Ex: Below is the input data frame I get whenever I run the script. enter image description here There is a change for one of the record and one new record gets added. only updated record should get updated and new record should be added to the final table.
The output table is expected to be updated as below. enter image description here

I tried delta merge option but it updates all matching record which is not expected.

I have to keep the load_ts untouched if there NO update/insertion of new record happens (This is to track the time when any update is happened to that record). If any record is updated, ONLY it should update in target table.

Is there any way/logic to achieve this ?
Any leads Appreciated!

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Antony
  • 970
  • 3
  • 20
  • 46

1 Answers1

0

I reproduced the same thing in my environment its worked fine with the merge operation.

First of all, I create a sample Data frame and did a merge operation.

I have taken sour as a target and vv as a source. Please follow below code:

Code:

%sql

MERGE INTO sour AS target
USING vv as source
 on target.location = source.location
 WHEN MATCHED 
THEN UPDATE SET 
 target.db_name = source.db_name,
 target.table_name = source.table_name,
 target.location = source.location,
 target.table_type = source.table_type,
 target.load_ts = source.load_ts
 
WHEN NOT MATCHED then
INSERT (target.db_name,target.table_name,target.location,target.table_format,target.table_type,target.load_ts) VALUES(source.db_name,source.table_name,source.location,source.table_format,source.table_type,source.load_ts)

enter image description here

Output:

enter image description here

B. B. Naga Sai Vamsi
  • 2,386
  • 2
  • 3
  • 11
  • Thanks for your response. As I mentioned in question, merge option will update all the records which satisfies the condition, which is not expected. From the target table, only the new record should be added and ONLY the updated record from source should be updated in Target. Rest of the records should remain as it is including the load_ts. – Antony Sep 19 '22 at 09:30
  • add a subcondition to `when matched` – Alex Ott Sep 20 '22 at 10:10