1

do You know is it possible, and if yes what is the best way to ensure exactly one delivery to hdfs using kafka connect with kafka?

I know that Kafka connect attempt to find offsets for its consumer group in the "'__consumer_offsets" but I need additional check as duplicates is not acceptable

2 Answers2

0

HDFS Connect already claims to support exactly once by using a write ahead log in HDFS. When connect is restarted, it actually checks that log, unless the logic recently changed, not the offsets topic

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

When connector writes files to HDFS it first writes to temp file,WAL for replay and then rename temp to final file . The naming of this final file has the offsets that is present in that file. So when connect starts up it looks on HDFS and finds the latest committed offset which should guarantee once only delivery. If an offset is not found in hdfs then it lets consumers offset reset policies .Take a look at https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/DataWriter.java and https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java to understand more

rookie
  • 386
  • 6
  • 19