2

I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.

Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.

Here is what my test pattern looks like so far:

enter image description here

Inside the Data Flow Task:

enter image description here

I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.

This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.

But the issue is it doesn't account for records that are deleted in production.

How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.

Thank you in advance.

Jonathan Porter
  • 1,365
  • 7
  • 34
  • 62
  • 1
    I can guarantee that this will be waaaay slower. Most of those are slow blocking tasks. Have you tested it yet? You could move it all to SSIS by just calling the stored procedure from SSIS – Nick.Mc Dec 09 '16 at 15:03
  • 1
    There are two basic patterns to detect differences in tables: 1. Compare the entire tables based on a unique key; 2. Write all changes to a log when they occur (i.e. triggers or CDC) and only apply the changes – Nick.Mc Dec 09 '16 at 15:07
  • @Nick.McDermaid Thanks for your comments! I did test it and it's a bit slower during the initial load. Since I made this for incremental loads, like you said, this pattern will detect differences using (in my case) the ID as a unique value and the HashValue to determine modified data. I don't have access to CDC (no enterprise for me) but do you know if triggers would be faster for incremental loading many rows vs this? Thanks again. – Jonathan Porter Dec 09 '16 at 20:21
  • Triggers are a manual way of implementing CDC. You need to create triggers and log tables in the source which can often be an issue with vendors. I'm suprised that you had comparable performance as this is a row-by-row approach rather than a batch approach. As you begin your SSIS journey, it's important to understand the differences between a 'SSIS task' approach and a 'ELT' aproach (where transformation happens with SQL statements or stored procedures). – Nick.Mc Dec 09 '16 at 23:10

2 Answers2

3

The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.

Then we can batch delete missing records in an execute SQL task.

Inside Data Flow Task:

DFT1

Inside Lookup Transformation:

(note the Redirect rows to no match output)

DFT2

DFT3

Lookup

So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.

EST

Jonathan Porter
  • 1,365
  • 7
  • 34
  • 62
0

I think you'll need to adopt you dataflow to use a merge join instead of a lookup. That way you can see whats new/changed & deleted. You'll need to sort both Flows by the same joining key (in this case your hash column).

Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.

john McTighe
  • 1,181
  • 6
  • 8
  • Hi @john thanks for your answer. After doing research I found that the merge method might be much slower especially if I handled inserts, updates and deletes in one pass as you mentioned vs if I did those separately. Do you have any resources that compare your method to something similar to mine? Reference: http://www.sqlservercentral.com/Forums/Topic1465931-391-1.aspx – Jonathan Porter Dec 09 '16 at 20:30
  • @JonathanPorter - Keep in mind that in that thread they're comparing the performance of MERGE to *batch* UPDATE/INSERT/DELETE, not row-by-row like you're doing. As someone who very heavily uses MERGE in ELT processes, it's very rare that I run into performance problems with them - usually I find them worthwhile in terms of maintainability, how quick they are to write vs. using SSIS or separate statements, usefulness of the OUTPUT clause, etc. But I also don't use them to do DELETEs very frequently, so can't vouch for them on that front. – Jo Douglass Dec 12 '16 at 13:15