Snowflake connector duplicate records

Question

According to the docs

Both Kafka and the Kafka connector are fault-tolerant. 
Messages are neither duplicated nor silently dropped. 
Messages are delivered exactly once, or an error message will be generated

we have in SF 2 records that have the same RECORD_METADATA:

{
  "CreateTime": 1596445576884,
  "key": "c�f4��H�h\u000bQ1`��\u0005*�X_a�q.",
  "offset": 319944,
  "partition": 20,
  "topic": "answers.v6.dwh-interaction-event"
}

Our topic key is a Protobuf record, but I assume that should not be a problem.

The docs are accurate. Are you sure you didn't get 2 messages from the source? — Mike Walton, Aug 03 '20 at 12:25
@MikeWalton If the duplicate messages came from the topic, wouldn't we have a different partition offset pair in the metadata record? — Jonathan David, Aug 03 '20 at 13:00

score 1 · Answer 1 · answered Aug 03 '20 at 15:46

If the duplicate messages came from the topic, wouldn't we have a different partition offset pair in the metadata record?

Yes if you produced twice, the messages would have different offsets.

Exactly once is a complex topic and the implementation of exactly-once consumption requires a process specific to the destination. This blog covers the two failure modes which need to handled for exactly once to be implemented successfully.

Specifically:

A - Write to destination fails. In this case SnowflakeSink, the kafka connector, needs to inform kafka connect of failure to write to the destination. That is more complex than it seems.
B - Commit to kafka fails. In this case SnowflakeSink is given a record it's already processed. So it needs to rollback the transaction so the row isn't inserted on the snowflake side or if say auto commit was enabled, it needs to check the destination to ensure the record doesn't already exist.

I have only done a cursory review of the connector, but based on this comment I think A is handled in the sink.

It could be handled elsewhere, but to handle B, I would have expected that processedOffset instance variable be populated on start by the highest offset found in the destination.

Generally even if the guarantees exist, I think it's best to plan for duplicates. As @MikeWalton suggests, it's possible to generate duplicates on the producer side as well and Snowflake provides robust tooling for merging tables.

Thank you for your response. The docs state: "Although the Kafka connector guarantees exactly-once delivery, it does not guarantee that rows are inserted in the order that they were originally published." I opened a support request with SF and I will update this question accordingly, — Jonathan David, Aug 03 '20 at 18:20
@JonathanDavid That quote is only applicable for the S3 sink — OneCricketeer, Jul 01 '22 at 21:53

Snowflake connector duplicate records

1 Answers1