I am importing into a Delta lake external data through an entity that has approximately 20M rows. Not huge but ok. The issue is the key to this data set is a guid, so using the normal Delta Merge operation it is notoriously inefficient as key is in effect unstructured. The merge currently takes 20 mins.
There is a creation date field as part of the entity structure so I have been playing around with partitioning the entity using year, month and day to try and optimise performance..
Table structure
.....
source_system_modified_datetime TIMESTAMP ,
__modified_timestamp TIMESTAMP,
year INT ,
month INT,
day INT
)
USING delta
LOCATION '/mnt/align/delta/entity'
PARTITIONED BY ( year, month, day)
relevant part of the merge statement:
%sql
--Merge
MERGE INTO crm_entity_aligned AS target
USING stg_entity AS source
ON (
(target.year = source.year
AND target.month = source.month
AND target.day = source.day )
AND target.entity_guid = source.entity_guid
)
WHEN MATCHED AND (
.......further conditions
I was attempting to try to use the creation date to narrow the search pattern before the guid part of the query was actioned, if that makes sense. Currently I have not been able to make any improvements in the merge time..
Has anyone else got a better working approach that they have used for this issue ? Is there another way to structure the MATCH part of the merge statement ? Is there an alternative PARTITION approach or even another way to get the merge operation time back to a reasonable level , say 3mins....
Any answers appreciated... Richard