1

I haven't been able to find documentation/an explanation on how you would reload incremental data using Change Data Capture (CDC) in SQL Server 2014 with SSIS.

Basically, on a given day, if your SSIS incremental processing fails and you need to start again. How do you stage the recently changed records again?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
tember
  • 1,418
  • 13
  • 32

4 Answers4

2

I suppose it depends on what you're doing with the data, eh? :) In the general case, though, you can break it down into three cases:

  1. Insert - check if the row is there. If it is, skip it. If not, insert it.
  2. Delete - assuming that you don't reuse primary keys, just run the delete again. It will either find a row to delete or it won't, but the net result is that the row with that PK won't exist after the delete.
  3. Update - kind of like the delete scenario. If you reprocess an update, it's not really a big deal (assuming that your CDC process is the only thing keeping things up to date at the destination and there's no danger of overwriting someone/something else's changes).
Ben Thul
  • 31,080
  • 4
  • 45
  • 68
  • But the first step in the ETL - grabbing the changed data from the source. How do you 'reset' it so it stages those recent changes again. Once it is in the staging database it is easy to deal with. But the CDC - black box part of the process is stumping me. I would like to re-stage the same data if required (manually is fine). – tember Feb 24 '15 at 18:28
  • 1
    Any pull of CDC data is predicated on you providing a first_lsn and last_lsn. Presumably, in the case of failure, you'd just provide the same endpoints again. – Ben Thul Feb 24 '15 at 18:30
  • how would you do that? I guess that breaks down in two questions: How do you read the lsn value? How would you tell SSIS to use a custom lsn range? – tember Feb 24 '15 at 18:36
  • Ah… you ceded control to a black box (i.e. the CDC Control Task). I don't know how it keeps track of LSN ranges (which is why I chose not to use it the last time I had to deal with CDC). Looks like a good opportunity for you to learn though, eh? :) – Ben Thul Feb 24 '15 at 18:46
  • ha ha - yes, I guess I will have to learn :P Can you tell me how to read the LSN values? – tember Feb 24 '15 at 18:51
  • Found it - the table [cdc].[lsn_time_mapping] defines the lsn ranges. I thought it was hidden the actual lsn value. I think I can figure LSN values now. But I would still like to know how the records get 'marked' as processed. – tember Feb 24 '15 at 18:58
  • 1
    They don't. I suggest that you read https://msdn.microsoft.com/en-us/library/hh758674.aspx (specifically, the section entitled "Handling CDC State Persistency"). Also, I'd suggest forgetting about LSN-to-time mapping. That's merely a convenience and you risk missing transactions. CDC provides functions for dealing with LSN values directly to ensure that you have gapless processing. Coincidentally, I wrote an answer to another question about this yesterday. Check that out, too: http://stackoverflow.com/questions/28675210/sql-cdc-data-processing-options/28680689#28680689 – Ben Thul Feb 24 '15 at 19:17
  • Can I mark both answers as accepted? Neither answers my question completely - but both helped me understand what I am dealing with. – tember Feb 24 '15 at 22:20
  • You don't have to choose either. But to answer your question, you can only mark one as the answer. – Ben Thul Feb 25 '15 at 03:28
2

Assuming you are using the new CDC SSIS 2012 components, specifically the CDC Control Task at the beginning and end of the package. Then if the package fails for any reason before it runs the CDC Control Task at the end of the package those LSNs (Log Sequence Number) will NOT be marked as processed so you can just restart the SSIS package from the top after fixing the issue and it will just reprocess those records again. You MUST use the CDC Control Task to make this work though or keep track the LSNs yourself (before SSIS 2012 this was the only way to do it).

Matt Masson (Sr. Program Manager on MSFT SQL Server team) has a great post on this with a step-by-step walkthrough: CDC in SSIS for SQL Server 2012

Also, see Bradley Schacht's post: Understanding the CDC state Value

billinkc
  • 59,250
  • 9
  • 102
  • 159
Steve Wake
  • 21
  • 1
  • I do have it setup with CDC Control Tasks in SSIS 2012 and I have working for the most part. I guess I am asking is how do you "unmark" those records? – tember Feb 24 '15 at 18:40
  • 1
    Also, for a breakdown of the CDC States used by the CDC Control Task and how you can modify it to reprocess CDC records, see Bradley Schacht's post: http://www.bradleyschacht.com/understanding-the-cdc-state-value/ – Steve Wake Feb 24 '15 at 18:44
1

So I did figure out how to do this in SSIS.

I record the min and max LSN number everytime my SSIS package runs in a table in my data warehouse.

If I want to reload a set of data from the CDC source to staging, in the SSIS package I need to use the CDC Control Task and set it to "Mark CDC Start" and in the text box labelled "SQL Server LSN to start...." I put the LSN value I want to use as a starting point.

I haven't figured out how to set the end point, but I can go into my staging table and delete any data with an LSN value > then my endpoint.

You can only do this for CDC changes that have not been 'cleaned up' - so only for data that has been changed within the last 3 days.

As a side point, I also bring across the lsn_time_mapping table to my data warehouse since I find this information historically useful and it gets 'cleaned up' every 4 days in the source database.

tember
  • 1,418
  • 13
  • 32
  • the default clean-up job retains the cdc data for 3 days, but you can change the default value to whatever you want(7 days) – Siva Aug 27 '19 at 23:43
0

To reload the same changes you can use the following methods.

Method #1: Store the TFEND marker from the [cdc_states] table in another table or variable. Reload back the marker to your [cdc_states] from the "saved" value to process the same range again. This method, however, allows you to start processing from the same LSN but if in the meanwhile you change table got more changes those changes will be captured as well. So, you can potentially get more changes that happened after you did the first data capture.

Method #2: In order to capture the specified range, record the TFEND markers before and after the range is processed. Now, you can use the OLEDB Source Connection (SSIS) with the following cdc functions. Then use the CDC Splitter as usual to direct Inserts, Updates, and Deletes.

DECLARE @start_lsn binary(10);
DECLARE @end_lsn binary(10);
SET @start_lsn = 0x004EE38E921A01000001;-- TFEND (1) -- if null then sys.fn_cdc_get_min_lsn('YourCapture') to start from the beginnig of _CT table
SET @end_lsn = 0x004EE38EE3BB01000001;  -- TFEND (2)
    SELECT * FROM [cdc].[fn_cdc_get_net_changes_YOURTABLECAPTURE](
     @start_lsn
    ,@end_lsn
    ,N'all' -- { all | all with mask | all with merge }
    --,N'all with mask' -- shows values in "__$update_mask" column
    --,N'all with merge' -- merges inserts and updates together. It's meant for processing the results using T-SQL MERGE statement
    )
ORDER BY __$start_lsn;
DmitryV
  • 31
  • 3