The link you posted is basically saying to stage everything and use a MERGE
to update your table (essentially an UPDATE
/INSERT
).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A
would be to add a "last updated" timestamp to table A
and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn
column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn
column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + @[User::MaxModifiedOnDate] + "')"
@[User::MaxModifiedOnDate]
(string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD
part, as well as the CAST
to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
@DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT @DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID
in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
@DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = @DataLoadID
When it comes to building your query for table A
, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate
would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory
table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.