0

I am using a very simple architecture to copy data from an external source into Azure Data Lake Storage gen 2 and serve it to PowerBI via a Serverless pool (where I perform some aggregations).

For the initial load, I used CopyData activity (Synapse Pipeline) and I store the data in parquet files.

Since parquet / ADLS2 does not support UPDATE operations on files, I am looking for best practices to create the incremental load (watermarking process) without using an additional database from where I can query the control/watermark table and run the stored procedure to update the last run date.

Has anyone bumped into this before? Thanks!

PS: I first checked here the best practice: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
ciufy13
  • 5
  • 1

1 Answers1

0

Synapse Serverless SQL pools can write data to ADLS using the CETAS command. If you load from the source any incremental inserts, updates or deletes to a parquet file, you can use a CETAS to full outer join the old table with the incremental changes and create a new table in a new folder. It will probably require a little orchestration with a Synapse pipeline to delete the old folder and rename the new folder at the end of the process. We have used this exact approach with Synapse pipelines and Synapse Serverless SQL CETAS before and it works well for moderate sized data.

I wouldn’t use it for TB sized data because you are completely rewriting the parquet files for the table every day. In that situation I would use Databricks or Synapse Spark compute to maintain Delta Lake tables as the final storage in ADLS since Delta Lake supports updates. Synapse Serverless SQL can still query Delta lake tables.

GregGalloway
  • 11,355
  • 3
  • 16
  • 47
  • Thanks Greg for your insightful answer. I guess my issue is not around merging the incremental loads, but how you keep that control table (and where) to identify what the next run should copy into ADLS. So how did you produce those "watermarks" in your implementation? – ciufy13 Jul 12 '22 at 11:34
  • @ElenaM we often spin up a small Serverless tier Azure SQL DB which auto-terminates after an hour for this. It works well if you are just running the loads a few times a day. But I think for this project we just stored the watermarks in an on-prem SQL database on the same server as the source. I suppose you could store this watermark in files in the data lake but that’s not typical. – GregGalloway Jul 12 '22 at 13:56
  • Awesome, I wanted to first check if I am missing any options before considering an SQL DB. Thanks for your support! :) – ciufy13 Jul 13 '22 at 09:41