SSIS data flow - copy new data or update existing

Question

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.

If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.

clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.

In source table A I have audit columns: createdOn and modifiedOn.

I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so. What is the best and simple way to achieve this.

Chris Mack · Answer 1 · 2017-11-19T17:46:22.070

The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).

The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.

One way to do this is with a trigger; see here for an example.

You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.

Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.

Examples:

There are a few different ways you could do it:

ONE

Include the modifiedOn column on in your final destination table.

You can then build a dynamic query for your data flow source in a SSIS string variable, something like:

"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + @[User::MaxModifiedOnDate] + "')"

@[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:

SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable

The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.

TWO

If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:

CREATE TABLE DataLoadHistory
(
    DataLoadID int PRIMARY KEY IDENTITY
    , DataLoadStart datetime NOT NULL
    , DataLoadEnd datetime
    , Success bit NOT NULL
)

You would begin each data load with this (Execute SQL Task):

CREATE PROCEDURE BeginDataLoad
    @DataLoadID int OUTPUT
AS

INSERT INTO DataLoadHistory
(
    DataLoadStart
    , Success
)

VALUES
(
    GETDATE()
    , 0
)

SELECT @DataLoadID = SCOPE_IDENTITY()

You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:

CREATE PROCEDURE DataLoadComplete
    @DataLoadID int
AS

UPDATE DataLoadHistory
SET
    DataLoadEnd = GETDATE()
    , Success = 1
WHERE DataLoadID = @DataLoadID

When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:

SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1

So the DataLoadHistory table, rather than your destination table.

Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.

THREE

I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).

It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

I don’t think I need trigger, I just can insert value in columns createdOn and modifiedOn from table A to temp table, but in the end this temp table will be empty for every successful ssis job. So its quite difficult to compare the record between table A and temp table. If u don’t mind, pls post some example query on how can I select based on the timestamp. For my case, the temp table will be empty for every successful ssis job. — xChaax, Nov 19 '17 at 03:46
I've updated my answer with some examples. The trigger part you can disregard, as you already have the `modifiedOn` column available. — Chris Mack, Nov 19 '17 at 17:31

SSIS data flow - copy new data or update existing

1 Answers1