Duplicates while publishing data to kafka topic using spark-streaming

Question

I have spark-streaming application which consumes data from topic1 and parse it then publish same records into 2 processes one is into topic2 and other is to hive table. while publishing data to kafka topic2 I see duplicates and i don't see duplicates in hive table

using spark 2.2, Kafka 0.10.0

KafkaWriter.write(spark, storeSalesStreamingFinalDF, config)
writeToHIVE(spark, storeSalesStreamingFinalDF, config)


object KafkaWriter {

  def write(spark: SparkSession, df: DataFrame, config: Config)
  {
    df.select(to_json(struct("*")) as 'value)
      .write
      .format("kafka")
      .option("kafka.bootstrap.servers", config.getString("kafka.dev.bootstrap.servers"))
      .option("topic",config.getString("kafka.topic"))
      .option("kafka.compression.type",config.getString("kafka.compression.type"))
      .option("kafka.session.timeout.ms",config.getString("kafka.session.timeout.ms"))
      .option("kafka.request.timeout.ms",config.getString("kafka.request.timeout.ms"))
      .save()
  }
}

Can some one help on this,

Expecting no duplicates in kafka topic2.

Not yet, handling duplicates while consuming . Still looking for an option to eliminate while publishing — user5463155, Oct 11 '19 at 14:47

score 1 · Answer 1 · answered May 21 '19 at 05:17

1

To handle the duplicate data ,we should set the .option("kafka.processing.guarantee","exactly_once")

answered May 21 '19 at 05:17

Rohit Yadav

2,252
16
18

This didn't help, I still see duplicates – user5463155 May 22 '19 at 16:18

Duplicates while publishing data to kafka topic using spark-streaming

1 Answers1