0

I have a bunch of U-SQL activities that manipulates & transform data in an Azure Data Lake. Out of this, I get a csv file that contains all my events.

Next I would just use a Copy Data activity to copy the csv file from the Data Lake directly into an Azure SQL Data Warehouse table.

Extract-Transform-Merge-Load

  1. I extract the information from a bunch of JSON files stored in the Data Lake and create a staging .csv file;
  2. I grab the staging .csv file & a production .csv file and inject the latest change (and avoid duplicates) and save the production .csv file;
  3. Copy the .csv production file directly to the Warehouse table.

I realized that my table contains duplicated rows and, after having tested the U-SQL scripts, I assume that the Copy Data activity -somehow- merges the content of the csv file into the table.

Question

I am not convinced I am doing the right thing here. Should I define my warehouse table as an EXTERNAL table that would get its data from the .csv production file? Or should I change my U-SQL to only include the latest changes?

Kzryzstof
  • 7,688
  • 10
  • 61
  • 108

1 Answers1

0

If you want to use external tables depends on your use case. If you want the data to be stored inside SQL DW for better performance, you have to copy it at some point, e.g. via a stored procedure. You could then just call the stored procedure from ADF, for instance.

Or, if you don't want to / cannot filter out data beforehand, you could also implement an "Upsert" stored procedure in your SQL DW and call this to insert your data instead of the copy activity.

silent
  • 14,494
  • 4
  • 46
  • 86
  • Is there a reason why I would prefer using copying the data from the Lake to the Warehouse (with the Copy activity) instead of creating an external table in the warehouse and referencing the data from the Lake? I mean, is one more efficient than the other? – Kzryzstof Apr 12 '19 at 14:01
  • If, for instance, you cannot (or don't want to) configure external tables (inclduing storing the storage account key) or stored procedures inside your SQL DW, a copy from Data Factory could make sense. But performance wise Polybase (External Table) is probably better. – silent Apr 15 '19 at 08:08