Incrementally loading into a Synapse table using Spark

Question

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.

The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.

Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.

A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.

Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.

I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.

All suggestions are greatly appreciated.

Synapse Spark supports Delta Lake https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-delta-lake-overview?pivots=programming-language-python#conditional-update-without-overwrite so as long as you don’t have technologies downstream where Delta Lake is unsupported (e.g. Synapse Dedicated SQL Pool at the moment) I would reconsider using Delta Lake. — GregGalloway, Jan 01 '23 at 12:34
I wasn't aware that you could use Delta Lake without DataBricks. It seems that currently, it is only available as an inline data set so it might take some playing around to work with but I'll definitely give it a try. Thanks! — VisibleRound0, Jan 02 '23 at 08:24

Incrementally loading into a Synapse table using Spark

0 Answers0