Spark structured and Dstream application is writing duplicates

Question

We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream and Structured apis. The source is kafka topic. The behavior of checkpoint directory sounds very random. I have not come across very relevant information on the issue.

Question is: Can checkpoint directory provide exactly once behavior?

scala version: 2.11.8
spark version:  2.3.1.3.0.1.0-187
kafka version :  2.11-1.1.0
zookeeper version :  3.4.8-1 
HDP : 3.1

Any help is appreciate. Thanks, Gautam

object sparkStructuredDownloading {
    val kafka_brokers="kfk01.*.com:9092,kfk02.*.com:9092,kfk03.*.com:9092"
    def main(args: Array[String]): Unit = {
        var topic = args(0).trim().toString()
        new downloadingAnalysis(kafka_brokers ,topic).process()
        }

}

class downloadingAnalysis(brokers: String,topic: String) {

    def process(): Unit = {
        //  try{
        val spark = SparkSession.builder()
        .appName("sparkStructuredDownloading")
        // .appName("kafka_duplicate")
        .getOrCreate()
        spark.conf.set("spark.streaming.stopGracefullyOnShutdown", "true")

        println("Application Started")
        import spark.implicits._
        import scala.concurrent.duration._
        import org.apache.spark.sql.streaming.{OutputMode, Trigger}
        import org.apache.spark.sql.streaming.Trigger

        val inputDf = spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", brokers)
        .option("subscribe", topic)
        .option("startingOffsets", "latest")
        //.option("kafka.group.id", "testduplicate")
        .load()
        val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)") //Converting binary to text
        println("READ STREAM INITIATED")

        import scala.concurrent.duration._
        import org.apache.spark.sql.streaming.{OutputMode, Trigger}
        import org.apache.spark.sql.streaming.Trigger

        import spark.implicits._
        val filteredDF= personJsonDf.filter(line=> new ParseLogs().validateLogLine(line.get(0).toString()))

        spark.sqlContext.udf.register("parseLogLine", (logLine: String) => {
        val df1 = filteredDF.selectExpr("parseLogLine(value) as result")
        println(df1.schema)
        println("WRITE STREAM INITIATED")
        val checkpoint_loc="/warehouse/test_duplicate/download/checkpoint1"
        val kafkaOutput = result.writeStream
        .outputMode("append")
        .format("orc")
        .option("path", "/warehouse/test_duplicate/download/data1")
        .option("maxRecordsPerFile", 10)
        .trigger(Trigger.ProcessingTime("10 seconds"))
        .start()
        .awaitTermination()

}

show consumer construction code please, also why you dont commit offset manualy? — maxime G, Feb 14 '19 at 17:43
Thanks Am sending the relevant code snippet: Did not use external kafka management as suggest approach is to use the check point dir approach. But open for the change please suggest. — GG GVG, Feb 15 '19 at 08:59
I am wondering if you have tried the `dropDuplicates` option.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication — nullmari, Mar 21 '19 at 08:19

Spark structured and Dstream application is writing duplicates

0 Answers0