2

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.

Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.

What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.

I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.

Thanks for any help.

Michael
  • 2,507
  • 8
  • 35
  • 71

1 Answers1

3

The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.

For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.

Add the column HashValue to your table as a varbinary data type. NOT NULL as well.

Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.

Create a RowsToSend table that's structurally identical to your main table, including the HashValue.

Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)

Send out your full data set.

Now move all of the key values and HashValues to the SentRows table. Truncate your main table.

On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.

Primary key match + hash match = Unchanged row

Primary key match + hash mismatch = Updated row

Primary key in incoming data but missing from existing data set = New row

Primary key not in incoming data but in existing data set = Deleted row

Pull out any changes you need to send to the RowsToSend table.

Send the changes from RowsToSend.

Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.

Truncate the SentRows table to get ready for tomorrow.

If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.

And away you go. Nothing but deltas from now on.

Edit 2019-10-31:

Step by step (or TL;DR):

1) Flush and Fill MainTable.

2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.

3) Move new/changed rows to RowsToSend.

4) Send the rows that are in RowsToSend.

5) Move all the rows from RowsToSend to SentRows.

6) Truncate RowsToSend.

Eric Brandt
  • 7,886
  • 3
  • 18
  • 35
  • Glad it helped! It's a pattern I'm painfully familiar with. :) – Eric Brandt Sep 26 '19 at 15:30
  • I'm not sure I have followed this 100%. My SentRows table is getting truncated and loaded each day, so computed GETDATE column is always todays date - I never get a complete history of what was actually sent – Michael Oct 31 '19 at 11:32
  • I think its the step of: Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes. I'm just truncating the sent rows then copying from the main table – Michael Oct 31 '19 at 11:47
  • 1
    I added a synopsis. It sounds like you're missing a table in the flow. – Eric Brandt Oct 31 '19 at 13:19
  • Thank you, I will take a look. Thanks again. – Michael Oct 31 '19 at 13:52
  • What is the best way to handle deletes? I seem to be getting them duplicated in the SentRows and thus keep sending them on each run – Michael Nov 01 '19 at 11:24
  • In the context of the answer, a `delete` is a record that was in your source system yesterday, but is gone today. It sounds like you're talking about something different from that. It also sounds like we've moved past the initial question, and now we're talking about fine tuning your specific implementation of the sort of broad strokes in the answer. I don't know enough detail about what that looks like to be helpful, especially in the context of the comments section. – Eric Brandt Nov 01 '19 at 13:08
  • If you're still stuck, it may be worth your time to put together a new question that includes the details of what you've built so far. These two posts have a lot of good resources to help you put that together. [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) and [How to post a T-SQL question on a public forum](https://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/) – Eric Brandt Nov 01 '19 at 13:08
  • I've worked it out. I have created an additional step between 4 and 5: Delete from SentRows those matching in SendToRows. This ensures the record in SentRows is unique and only ever the most recent Delta change – Michael Nov 01 '19 at 14:14
  • Thanks again though for your help - I owe you a beer!! – Michael Nov 01 '19 at 14:16
  • Next time I'm in the UK, I'll give you a ring. :) – Eric Brandt Nov 01 '19 at 14:26