I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.
My pipeline contains the following activities:
- [Stored Procedure] Clear staging tables;
- [Stored Procedure] Get the timestamp of the last successful update;
- [U-SQL] Extract the dimension data from the filtered files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
- [Copy Data] Load the csv into a SQL datawarehouse staging dimension table;
- [Stored Procedure] Merge the data from the staging table into the production table;
- [U-SQL] Extract the fact data from the files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
- [Copy Data] Load the csv into a SQL datawarehouse fact table;
- [Stored Procedure] Update the timestamp of the successful update.
The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.
Question
How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?
Update
I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.
Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).