This happens from time to time - this is the strange part My current Solution :
Re-run the job ! :disappointed: - but this is very reactive - not happy3
This is how my merge stmt look like:
MERGE INTO target_tbl AS Target USING df_source AS Source ON Source.key = Target.key
WHEN MATCHED
AND Target.ctl_utc_dts = '9999-12-31'
AND Target.ctl_hash = Source.ctl_hash
AND Source.ctl_start_utc_dts < Target.ctl_start_utc_dts THEN
UPDATE
SET
Target.ctl_start_utc_dts = Source.ctl_start_utc_dts,
Target.ctl_updated_run_id = Source.ctl_updated_run_id,
Target.ctl_modified_utc_dts = Source.ctl_modified_utc_dts,
Target.ctl_updated_batch_id = Source.ctl_updated_batch_id
WHEN MATCHED
AND Target.ctl_utc_dts = '9999-12-31'
AND Target.ctl_hash != Source.ctl_hash THEN
UPDATE
SET
Target.ctl_utc_dts = Source.ctl_start_utc_dts,
Target.ctl_updated_run_id = Source.ctl_updated_run_id,
Target.ctl_modified_utc_dts = Source.ctl_modified_utc_dts,
Target.ctl_updated_batch_id = Source.ctl_updated_batch_id
WHEN NOT MATCHED THEN
INSERT(columns......)
VALUES(columns......)
Spark App configuration:
--conf spark.yarn.stagingDir=hdfs://$(hostname -f):8020/user/hadoop
--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark
--conf spark.yarn.submit.waitAppCompletion=true
--conf spark.yarn.maxAppAttempts=5
--conf yarn.resourcemanager.am.max-attempts=5
--conf spark.shuffle.service.enabled=true
--executor-memory 24G
--driver-memory 60G
--driver-cores 6
--executor-cores 4
--conf spark.executor.asyncEagerFileSystemInit.paths=s3://s3_bkt
--conf spark.dynamicAllocation.maxExecutors=24
--packages io.delta:delta-core_2.12:1.0.0
Running on AWS EMR
Release label : emr-6.4.0
Hadoop distribution : Amazon 3.2.1
Applications: Spark 3.1.2
Any pointers of things i need to do different ? any help or ideas would be great ! Thx all