How to merge data into Delta Table without inserting certain columns (PySpark)

Question

Context:

Delta Lake allows developers to merge data into a table with something called a Merge Statement. I am using Delta Lake's Change Data Feed feature to determine whether I want to insert, update, or delete a certain row. This is determined by the value of the column _change_type, which is created by Delta Lake.

The Problems:

It is easy to use the _change_type column inside the Merge Condition, in order to determine what to do with each row. The problem is that if I use this column inside the merge condition it will be inserted into my target table, along with the rest of the data. I don't want my target table to contain this _change_type column, as it is useless and can create conflicts if I decide to use Change Data Feed on the target table too.

Additional Notes:

I am aware that it is possible to solve this issue by fixating the schema of the target table, however, I need to support schema evolution on the target table, as such, I can't do this. Ideally, I would be able to drop the _change_type column after checking the merge condition, but before actually merging into the table. Is something like this possible?

score 0 · Answer 1 · answered Apr 26 '23 at 11:15

I derive from your question that:

You are working with the Python API for Spark i.e. PySpark.
You are working within a Delta Lake, and therefore using the DeltaTable class from the delta.table API.
You are trying to update a table through a merge statement.

Under these assumptions, this is the misconception causing question:

The problem is that if I use this column inside the merge condition it will be inserted into my target table, along with the rest of the data.

If you check the official Delta docs, under the Upsert into a table using merge, you will notice the following clauses on the merge call:

.merge(merge_condition)
.whenMatchedUpdate(update_condition)
.whenNotMatchedInsert(insert_condition)

The key here is that the merge condition can be based on one field (i.e. your _change_type column), but your update or insert conditions can be based on totally different fields. Therefore, you just need to keep your merge condition as is, but make sure to add/change your update and insert conditions to include only the fields you want to update/insert.

Let me know if this works for you.

Thanks a lot for the reply, and sorry for taking so long to get back to you. The problem with that approach is that I don't know before-hand all of the columns I want to use in the update/insert. Since I'm auto-merging the schema I accept that new columns may arrive from the source system without prior notice, and I want to automatically insert them. In the end this stopped being a problem because we decided to forgo the CDF and simply perform a full overwrite in each table update. It was not a problem performance-wise and it solved to problem, so that's what we did. — Alexandre Carqueja, Jun 05 '23 at 12:53

How to merge data into Delta Table without inserting certain columns (PySpark)

Context:

The Problems:

Additional Notes:

1 Answers1