I have a simple Delta Lake table (my_table) that has three columns:
- col1 - the "primary key"
- col2
- col3
I'm attempting to construct a merge call that accomplishes the following:
- The Delta Lake transaction log does not get modified if there is no change in the source table (when compared to the target)
- A record is added to the target table if it is present in the source but not the target
- A record is updated in the target table from the source table if it is present in both the source and target tables AND (col2 or col3 have different values)
- A record is deleted from the target table if it is not present in the source but is present in the target
I'm successful with requirements #1 through #3. I'm having problems with requirement #4. I believe I can solve #4 by using a subquery for the condition in the whenNotMatchedBySourceDelete
function within a merge call. However, it is not working.
DeltaTable.forName(spark_session, 'my_table').alias('dest').merge(
source=source_dataframe.alias('updates'),
condition='dest.col1 = updates.col1 AND (dest.col2 != updates.col2 OR dest.col3 != updates.col3)')
.whenMatchedUpdateAll()
.whenNotMatchedBySourceDelete('dest.col1 NOT IN (SELECT col1 FROM updates)')
.execute()
The above code gives me an AnalysisException: Failed to resolve
error. Specifically, it complains about UnresolvedRelation [updates], [], false
.
What am I doing wrong?