Databricks merge optimisation where keys are Guids issue

Question

I am importing into a Delta lake external data through an entity that has approximately 20M rows. Not huge but ok. The issue is the key to this data set is a guid, so using the normal Delta Merge operation it is notoriously inefficient as key is in effect unstructured. The merge currently takes 20 mins.

There is a creation date field as part of the entity structure so I have been playing around with partitioning the entity using year, month and day to try and optimise performance..

Table structure

 .....
 source_system_modified_datetime TIMESTAMP ,
      __modified_timestamp TIMESTAMP,
year INT ,
month INT,
day INT
)
USING delta
LOCATION '/mnt/align/delta/entity' 
PARTITIONED BY ( year, month, day)

relevant part of the merge statement:

%sql
--Merge
MERGE INTO  crm_entity_aligned AS target
USING stg_entity AS source 
ON (

      (target.year = source.year
      AND target.month = source.month 
      AND target.day = source.day )
      
      AND  target.entity_guid = source.entity_guid
   )
 WHEN MATCHED AND ( 
.......further conditions

I was attempting to try to use the creation date to narrow the search pattern before the guid part of the query was actioned, if that makes sense. Currently I have not been able to make any improvements in the merge time..

Has anyone else got a better working approach that they have used for this issue ? Is there another way to structure the MATCH part of the merge statement ? Is there an alternative PARTITION approach or even another way to get the merge operation time back to a reasonable level , say 3mins....

Any answers appreciated... Richard

What makes the MERGE potentially expensive is that entire files need to be rewritten even when only a few rows are changing. If your change set `stg_entity` is sufficiently uniformly distributed across the target files, partitioning strategies won't really help. If your readers can tolerate smaller files, you should experiment with resizing the files in your Delta table to be smaller. This will result in less data rewrites and may get you closer. Otherwise, Delta will soon support Deletion Vectors, aka. merge-on-read behavior, which should significantly improve this. — Nick Karpov, Nov 29 '22 at 17:11
Thanks Nick, I did try dropping the file size but it didn't make any difference. — Richard H, Dec 01 '22 at 15:21

Databricks merge optimisation where keys are Guids issue

0 Answers0