1

We have a requirement to set up Kafka MicrosoftSqlServerSource connect. This is to capture all the transactions(insert/update) performing in one of the sales table in Azure SQL database.

In order to bring the support for the above source connect, we have initially enabled CDC at both database and table level. We also created a view out of the source table which will be the input for the source connect( TableType = VIEW in connector configuration). Once we complete the set up at both connector as well as database level, we could see messages flowing to the respective topic created automatically along with the connector as when a new updations/insertions happened at the table level.

One strange behavior we observed while testing is that when we stopped the testing, the last message received in the topic starts getting duplicated until a new message arrived.

Could you please help us to understand whether this is a system behavior? Or Did we miss any configuration that has resulted in these duplicate entries. Please guide us on how we can tackle the above duplicate issue.

Attaching the snapshot

Connector Summary


Connector Class = MicrosoftSqlServerSource
Max Tasks = 1
kafka.auth.mode = SERVICE_ACCOUNT
kafka.service.account.id = **********
topic.prefix = ***********
connection.host = **************8
connection.port = 1433
connection.user = ***************
db.name = **************88
table.whitelist = item_status_view
timestamp.column.name = ProcessedDateTime
incrementing.column.name = SalesandRefundItemStatusID
table.types = VIEW
schema.pattern = dbo
db.timezone = Europe/London
mode = timestamp+incrementing
timestamp.initial = -1
poll.interval.ms = 10000
batch.max.rows = 1000
timestamp.delay.interval.ms = 30000
output.data.format = JSON
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

0

What you're describing is controlled by

mode = timestamp+incrementing
poll.interval.ms = 10000

It should save the last timestamp, then query only for timestamps greater than the last... If you are getting greater than or equal to, then that is certainly a bug that should be reported.

Or you should read the docs

A timestamp column must use datetime2 and not datetime. If the timestamp column uses datetime, the topic may receive numerous duplicates

As an alternative, you could use Debezium (run your own Connector, not use Confluent Cloud offering) to truly stream all table operations.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245