1

I am currently using Confluent HDFS Sink Connector (v4.0.0) to replace Camus. We are dealing with sensitive data so we need to maintain consistency in offset during cutover to connectors.

Cutover plan:

  1. We created hdfs sink connector and subscribed to a topic which writes to a temporary hdfs file. This creates a consumer group with name connect-
  2. Stopped the connector using DELETE request.
  3. Using /usr/bin/kafka-consumer-groups script, I am able to set the connector consumer group kafka topic partition's current offset to a desired value (i.e. last offset Camus wrote + 1).
  4. When i restart the hdfs sink connector, it continues reading from the last committed connector offset and ignores the set value. I am expecting the hdfs file name to be like: hdfs_kafka_topic_name+kafkapartition+Camus_offset+Camus_offset_plus_flush_size.format

Is my expectation of confluent connector behavior correct ?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Rupesh More
  • 35
  • 2
  • 10

1 Answers1

4

When you restart this connector, it will use the offset embedded in the file have of the last file written to hdfs. It will not use the consumer group offset. It does this because it uses a write ahead log to achieve exactly once deliver to hdfs.

dawsaw
  • 2,283
  • 13
  • 10
  • 2
    Thank you @dawsaw for your quick response, makes more sense of the connector behavior now. I added a dummy file with name hdfs_kafka_topic_name+kafkapartition+dummy_offset+***camus_offset***.format and created a new connector. It started writing new files with camus_offset+1. Thanks. :) – Rupesh More Apr 16 '18 at 14:56
  • @dawsaw "offset embedded in the file have of the last file written to hdfs" this is embedded in the WAL file or in the final AVRO files? – Ashika Umanga Umagiliya Dec 27 '19 at 03:16
  • Final one, the wal isn't committed data yet – dawsaw Jan 05 '20 at 03:02