How to fully Automate CDC in SQL Server?

Question

Is there a way to 100% automate SQL Server CDC initialization in an active SQL Server database? I am trying to solve a problem finding from_lsn during first cdc data capture.

Sequence of events:

Enable CDC on given database/Table
Copy full table to destination (Data lake)
Use CDC to capture first delta (I want to avoid duplicates, without missing a transaction)

Problem:

How to get the from_lsn for fn_cdc_get_all_changes_Schema_Table(from_lsn, to_lsn, '<row_filter_option>') function

Note:

Need to automate 100%
Can not stop transactions on the table
Can not miss any data or can not afford duplicate data

You are going to have to take some major `Sch-M` locks in order to enable CDC. You might as well hold them for a fraction longer in order to pull `sys.fn_cdc_get_min_lsn('Schema_Table')` alternatively perhaps you can just use `0x0` see also https://learn.microsoft.com/en-us/sql/relational-databases/system-functions/sys-fn-cdc-get-min-lsn-transact-sql?view=sql-server-ver16 — Charlieface, Jul 20 '22 at 20:26
Thanks @Charlieface. I am concerned about duplicate data when I use `sys.fn_cdc_get_min_lsn('Schema_Table')`. I am thinking about sys.fn_cdc_map_time_to_lsn, looking for suggestions on `tracking_time`. Ref - https://learn.microsoft.com/en-us/sql/relational-databases/system-functions/sys-fn-cdc-map-time-to-lsn-transact-sql?view=sql-server-ver16 — Vijred, Jul 20 '22 at 20:36
Why would you get duplicate data? The worst that could happen without a transaction is that you miss some data. But as I said, you need to modify the table, which requires locking it, so if you put it in a transaction and call `fn_cdc_get_min_lsn` at the same time you should be OK. So what exactly are you asking? And what code do you have so far? — Charlieface, Jul 20 '22 at 20:38
More the point: "Need to automate 100%" and "Can not stop transactions on the table" are opposing requirements. For full automation, you need to allow the process to knock off and block any other transactions from happening on the table. — Charlieface, Jul 20 '22 at 20:47
Example problem Scenario: - 4:05:10 - Enable CDC on given database/Table - 4:05:12 - Start full Copy operation - 4:26:46 - Complete full Copy operation - 4:26:47 - Initiate first CDC Capture Now, fn_cdc_get_min_lsn picks lsn starting (4:42:10). It includes transactions starting 4:42:10 - Keeping a transaction open during full copy operation is not feasible as this takes very long. — Vijred, Jul 20 '22 at 20:49
That's what I mean when I said "opposing requirements": it can be done, but only at a cost of locking up the table. What I would do is call `fn_cdc_get_min_lsn` *before* doing the copy, then copy the data, then retrieve `fn_cdc_get_min_lsn` again along with net changes since the first LSN using `fn_net_changes_` and just apply the net changes, so that the final copied table is up to date with the second LSN — Charlieface, Jul 20 '22 at 20:54
Appreciate your time @Charlieface. Thanks for confirmation about, `we can NOT do it without locking the table`. — Vijred, Jul 20 '22 at 21:01
As I said, I also wouldn't lock the table. It's not necessary, just get the net changes from the first LSN to the current one and apply them to the copy — Charlieface, Jul 20 '22 at 21:02

allmhuran · Accepted Answer · 2022-07-21T11:43:34.557

Before doing the initial load, get the value of fn_cdc_get_max_lsn() and store it. This function returns the highest LSN known to CDC across all capture instances. It's the high water mark for the whole database.

Copy the whole table.

Start your delta process. The first time you call the delta function, the value of the min_lsn argument will be the stored value previously retrieved from fn_cdc_get_max_lsn() incremented by fn_cdc_increment_lsn. Get the current value from fn_cdc_get_max_lsn() (not the stored one) and use it as the value of the max_lsn argument.

From here proceed as you expect. Take the maximum LSN returned from the delta function, store it. Next time you pull a delta, use fn_cdc_increment_lsn on the stored value, use the result as the value of the min_lsn argument, and use the result of fn_cdc_get_max_lsn() as the max_lsn argument.

With this process you will never miss any data. (Not covered here: be sure to check that your boundary conditions fall within a valid lsn range)

Now, you mentioned that you want to avoid "duplicates". But if you try to define what a "duplicate" is in this scenario, I think you'll find it difficult.

For example, suppose I have this table to begin with:

create table t(i int primary key, c char);
insert t(i, c) values (1, 'a');

I call fn_cdc_get_max_lsn() and get 0x01.
A user inserts a new row into the table: insert t(i, c) values (2, 'b');
The user operation is associated with an LSN value of 0x02.
I select all the rows in this table (getting two rows).
I write both rows to my destination table.
I start my delta process. My min_lsn argument will be 0x02.

I will therefore get the {2, 'b'} row in the delta.

But I already retrieved the row {2, 'b'} as part of my initial load. Is this a "duplicate"? No, this represents a change to the table. What will I do with this delta when I load it into my destination? There are really only two options.

Option 1: I am going to merge the delta into the destination table based on the primary key. In that case, when I merge the delta I will overwrite the already-loaded row {2, 'b'} with the new row {2, 'b'}, the outcome of which looks the same as not doing anything.

Option 2: I am going to append all changes to the destination. In that case my destination table will contain the row {2, 'b'} twice. Is this a duplicate? No, because the two rows represent the how the data looked at different logical times. First when I did the initial load, and then when I did the delta.

If you try to argue that this is in fact a duplicate, then I counter by giving you this hypothetical scenario:

You do the initial load, receiving row {1, 'a'},
No users change any data.
You get your first delta, which is empty.
A user executes update T set c = 'b' where i = 1.
You get your second delta, which will include the row {1, 'b'}.
A user executes update T set c = 'a' where i = 1.
You get your third delta, which will include the row {1, 'a'}.

Question: Is the row you retrieved during your third delta a "duplicate"? It has the same values as a row we already retrieved previously.

If your answer is "yes", then you can never eliminate "duplicate" reads, because a "duplicate" will occur any time a row mutates to have the same values it had at some previous point in time, which is something over which you have no control. If this is a "duplicate" that you need to eliminate in the append scenario, then that elimination must be performed at the destination, by comparing the incoming values with the existing values.

Good, this is the long form of what I was arguing above. Normally duplicates are defined by looking at the primary key: if the primary key matches then merge. But I would ignore any changes in the first batch as representing actual *changes* to the row and just merge them into the base, because they are not transactionally consistent (eg you may have an earlier LSN for one row and a later LSN for the next in the first copy). You also need to enable CDC before copying rather than after in order to get the copied table perfect. — Charlieface, Jul 21 '22 at 09:38
@Charlieface I suspect the OP isn't merging at the destination because in their question they indicated the use of `get_all_changes` rather than `get_net_changes` for the delta, but yeah, the initial load is going to just come from the base table and is not a delta, but all subsequent loads are deltas, so there's a difference in semantics between the first and subsequent batches that makes the idea of a "duplicate" a bit strange. — allmhuran, Jul 21 '22 at 10:01
Thanks @allmhuran. I was trying to avoid duplicate inserts (first example in your explanation). Looks like there is no way to avoid and I have decided to process duplicates using primary key (Option#1 listed above). Note - cdc takes 15+ seconds at times fn_cdc_get_max_lsn() will be inaccurate until data inserted to CDC tables. Decided to store checkpoint-datetime and extract lsn from checkpoint-datetime. FYI: Tested a few scenarios for reference @ https://github.com/vijred/MSSQL/tree/master/CDC/CDC_IntegrationTest — Vijred, Jul 21 '22 at 21:16
@Vijred If you're going to use a `merge` at the destination you *probably* want `get_net_changes` (rather than `get_all_changes`), but I don't know the specifics of your scenario. You don't have to worry about `max_lsn` "inaccuracies", because for your "next low water mark" you're always going to use the *actual* high water mark returned from the change function. That is to say, even if the max lsn was reported inaccurately it wouldn't matter, since you're only using it as an "indicative" upper bound. — allmhuran, Jul 21 '22 at 21:46

How to fully Automate CDC in SQL Server?

1 Answers1