If the duplicate messages came from the topic, wouldn't we have a different partition offset pair in the metadata record?
Yes if you produced twice, the messages would have different offsets.
Exactly once is a complex topic and the implementation of exactly-once consumption requires a process specific to the destination. This blog covers the two failure modes which need to handled for exactly once to be implemented successfully.
Specifically:
- A - Write to destination fails. In this case SnowflakeSink, the kafka connector, needs to inform kafka connect of failure to write to the destination. That is more complex than it seems.
- B - Commit to kafka fails. In this case SnowflakeSink is given a record it's already processed. So it needs to rollback the transaction so the row isn't inserted on the snowflake side or if say auto commit was enabled, it needs to check the destination to ensure the record doesn't already exist.
I have only done a cursory review of the connector, but based on this comment I think A is handled in the sink.
It could be handled elsewhere, but to handle B, I would have expected that processedOffset instance variable be populated on start by the highest offset found in the destination.
Generally even if the guarantees exist, I think it's best to plan for duplicates. As @MikeWalton suggests, it's possible to generate duplicates on the producer side as well and Snowflake provides robust tooling for merging tables.