0

I have a simple Delta Lake table (my_table) that has three columns:

  • col1 - the "primary key"
  • col2
  • col3

I'm attempting to construct a merge call that accomplishes the following:

  1. The Delta Lake transaction log does not get modified if there is no change in the source table (when compared to the target)
  2. A record is added to the target table if it is present in the source but not the target
  3. A record is updated in the target table from the source table if it is present in both the source and target tables AND (col2 or col3 have different values)
  4. A record is deleted from the target table if it is not present in the source but is present in the target

I'm successful with requirements #1 through #3. I'm having problems with requirement #4. I believe I can solve #4 by using a subquery for the condition in the whenNotMatchedBySourceDelete function within a merge call. However, it is not working.

DeltaTable.forName(spark_session, 'my_table').alias('dest').merge(
    source=source_dataframe.alias('updates'),
    condition='dest.col1 = updates.col1 AND (dest.col2 != updates.col2 OR dest.col3 != updates.col3)')
    .whenMatchedUpdateAll()
    .whenNotMatchedBySourceDelete('dest.col1 NOT IN (SELECT col1 FROM updates)')
    .execute()

The above code gives me an AnalysisException: Failed to resolve error. Specifically, it complains about UnresolvedRelation [updates], [], false.

What am I doing wrong?

zzzz8888
  • 23
  • 3

0 Answers0