3

I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.

My pipeline contains the following activities:

  1. [Stored Procedure] Clear staging tables;
  2. [Stored Procedure] Get the timestamp of the last successful update;
  3. [U-SQL] Extract the dimension data from the filtered files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
  4. [Copy Data] Load the csv into a SQL datawarehouse staging dimension table;
  5. [Stored Procedure] Merge the data from the staging table into the production table;
  6. [U-SQL] Extract the fact data from the files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
  7. [Copy Data] Load the csv into a SQL datawarehouse fact table;
  8. [Stored Procedure] Update the timestamp of the successful update.

The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.

Question

How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?

Update

I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.

Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).

Kzryzstof
  • 7,688
  • 10
  • 61
  • 108

1 Answers1

2

Do the lifting in Azure SQL Data Warehouse ... your performance will improve dramatically, and your problem will go away.

How many rows are in your filtered files? If it is in the millions to tens of millions, I think you can probably avoid the filter at the data lake stage. The performance of Polybase + SQL should overcome the additional data volume.

If you can avoid the filter, use this logic and throw away the U-SQL processing:

  • Ingest files to staging table with suitable hash distribution
  • Take the latest version of each row (suitable for SCD1)
  • Merge stage to fact using a query like this:

BK = Business Key column/s. COLn = non-key columns

-- Get latest row for each business key to eliminate duplicates.

create table stage2 with (heap,distribution = hash(bk)) as
select  bk,
        col1,
        col2,
        row_number() over (partition by bk order by timestamp desc) rownum
from    stage
where   rownum = 1;

-- Merge the stage into a copy of the dimension

create table dimension_copy with (heap,distribution=replicate) as

select    s.bk,
          s.col1,
          s.col2
from      stage2 s
where     not exists (
              select  1
              from    schema.dimension d
              where   d.bk = s.bk)

union

select   d.bk,
         case when s.bk is null then d.col1 else s.col1 end,
         case when s.bk is null then d.col2 else s.col2 end
from     dimension d
         left outer join stage2 s on s.bk = d.bk;

-- Switch the merged copy with the original 

alter table dimension_copy switch to dimension with (truncate_target=on);

-- Force distribution of replicated table across nodes

select top 1 * from dimension;
Ron Dunn
  • 2,971
  • 20
  • 27
  • "If it is in the millions to tens of millions, I think you can probably avoid the filter at the data lake stage" Do you mean that I should consider loading **all** the facts every time in the Warehouse and not just a delta? I do not have problems with the Dimensions... only with the Facts that will grow for sure. And after reading your answer, I am not sure which case (Dim vs Fact) we are talking about. – Kzryzstof Apr 09 '19 at 11:49
  • 1
    Your original question was talking about dimensions. I would make a decision about facts based on how they were delivered to the data lake in the first place, and the number of rows in each delivery. There isn't enough space in comments to go into the nuances of the decision. You're welcome to email me ... put a dot between my name, and @microsoft.com on the end ... and we can discuss in more detail. – Ron Dunn Apr 09 '19 at 22:14
  • It is true that I mentioned dimensions in the description. However my question always was about the facts table... – Kzryzstof Apr 10 '19 at 12:22