0

I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.

I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.

My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?

I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?

I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.

Thanks,

Kim

Kim
  • 393
  • 2
  • 8
  • 18
  • A typical CDC workflow is "process any CDC data that I haven't processed previously" which entails keeping track of the last LSN that you processed for each table. You'd also set the retention well in excess of your processing cadence (i.e. if you process once a day you'd set retention to 3 days or something) so that if something goes wrong you have time to react before the data gets purged. And then you just let the system take care of it. That said, I'm not surprised that there's weird behavior at the edge case of 1 minute. What happens if you set it to something like 1 hour and schedule it? – Ben Thul Aug 13 '14 at 15:23
  • Thx! I set the retention time to 1 hour and scheduled it. I changed 2 records and after a few minutes another record. When the schedule+retentiontime was due, I saw it only deleted 1 record out of the 2 that I changed at first. I set another schedule again and then it deleted the 2nd record out of those 2. There's still another record left. So it still deleted one record at a time. It should be even longer than one hour? The thing is that I want to delete the CDC records on the same night. Like the SSIS task should copy all the records at midnight or so and before next morning I want to purge. – Kim Aug 14 '14 at 08:32
  • P.S. Can you explain (maybe small example) what you mean with your first sentence. I'm not sure if I understood that correctly. "A typical CDC workflow is "process any CDC data that I haven't processed previously" which entails keeping track of the last LSN that you processed for each table." – Kim Aug 14 '14 at 08:33
  • You shouldn't be accessing the capture tables directly, but rather getting the CDC data through functions that are created when you set up cdc. Specifically, cdc.fn_cdc_get_all_changes_ or cdc.fn_cdc_get_net_changes_. By managing the LSN ranges that you're processing with each run you can be sure that each row is processed once and only once. With your current scheme, you risk missing some (if changes happen between when you start the ETL and when you delete the CDC data). When you use the functions, you don't have to worry about what's getting deleted when. – Ben Thul Aug 14 '14 at 17:24

1 Answers1

0

Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:

create table dbo.ProcessingStatus (
   CaptureInstance sysname,
   LSN numeric(25,0),
   IsProcessed bit
)
create unique index [UQ_ProcessingStatus] 
   on dbo.ProcessingStatus (CaptureInstance) 
   where IsProcessed = 0
  1. Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
  2. Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
  3. Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
  4. Update IsProcessed = 1 for this run.

As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".

Ben Thul
  • 31,080
  • 4
  • 45
  • 68
  • Thanks for your explanation, I understand CDC better now. I experimented a bit with all of it and I used it with CDC control task in SSIS and it seems to be working perfectly now. It processes the CDC tables only once now, it's solved :) – Kim Aug 15 '14 at 13:20