2

I'm trying to develop a small Spark app (using Scala) to read messages from Kafka (Confluent) and write them (insert) into Hive table. Everything works as expected, except for one important feature - managing offsets when the application is restarted (submited). It confuses me.

Cut from my code:

  def main(args: Array[String]): Unit = {

    val sparkSess = SparkSession
      .builder
      .appName("Kafka_to_Hive")
      .config("spark.sql.warehouse.dir", "/user/hive/warehouse/")
      .config("hive.metastore.uris", "thrift://localhost:9083")
      .config("hive.exec.dynamic.partition", "true")
      .config("hive.exec.dynamic.partition.mode", "nonstrict")
      .enableHiveSupport()
      .getOrCreate()

    sparkSess.sparkContext.setLogLevel("ERROR")

    // don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
    sparkSess.udf.register("deserialize", (bytes: Array[Byte]) =>
      DeserializerWrapper.deserializer.deserialize(bytes)
    )
    

    val kafkaDataFrame = sparkSess
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", 'localhost:9092')
      .option("group.id", 'kafka-to-hive-1')
      // ------>   which Kafka options do I need to set here for starting from last right offset to ensure completenes of data and "exactly once" writing?   <--------
      .option("failOnDataLoss", (false: java.lang.Boolean))
      .option("subscribe", 'some_topic')
      .load()

    import org.apache.spark.sql.functions._
    
    // don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
    val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
    val df = valueDataFrame.select(
      from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
      .select("parsed_value.*")


    df.writeStream
      .foreachBatch((batchDataFrame, batchId) => {
        batchDataFrame.createOrReplaceTempView("`some_view_name`")
        val sqlText = "SELECT * FROM `some_view_name` a where some_field='some value'"
        val batchDataFrame_view = batchDataFrame.sparkSession.sql(sqlText);
        batchDataFrame_view.write.insertInto("default.some_hive_table")
      })
      .option("checkpointLocation", "/user/some_user/tmp/checkpointLocation")
      .start()
      .awaitTermination()
  }

Questions (the questions are related to each other):

  1. Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?
  2. Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")
  3. What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)
Michael Heil
  • 16,250
  • 3
  • 42
  • 77
deeplay
  • 376
  • 3
  • 20

1 Answers1

2

"Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?"

You would need to set startingOffsets=latest and clean up the checkpoint files.

"Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")"

Similar to first question, if you set the startingOffsets as the json string, you need to delete the checkpointing files. Otherwise, the spark application will always fetch the information stored in the checkpoint files and override the settings given in the startingOffsets option.

"What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)"

Asking about "the right way" might lead to opinion based answers and is therefore off-topic on Stackoverflow. Anyway, using Spark Structured Streaming is already a mature and production-ready approach in my experience. However, it is always worth also looking into KafkaConnect.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • Dear @mike, regarding to your first 2 answers: when you say "clean up the checkpoint files" you mean physically delete all files under `checkpointLocation/offsets/` directory? Is there any built-in functionality to call it from code? – deeplay Oct 22 '20 at 07:59
  • Yes, you need to delete them physically in your file system. There is no built-in functionality that delete the checkpoint files of a running query. – Michael Heil Oct 22 '20 at 08:45
  • 1
    keep in mind that the behaviour you are trying to achieve (start from latest offset on every submission of the job) is rather for testing purposes and you would probably never do this on production. If you do need to do this on production you may consider a batch job instead of a streaming job :-) – Michael Heil Oct 22 '20 at 08:47
  • OK, then how to handle offsets in prod? If prod app crashes for some reason (server is down, network issue etc.) and you are submitting your Sprak app again, how your app will know which offset to start read from Kafka and process it? I'm trying to find out right configuration to change my code. – deeplay Oct 22 '20 at 09:38
  • 1
    if you do not delete your checkpoint files on production the job will read the content in the checkpoint files to continue consuming from where it left off. If you keep the checkpoint files the setting on `startingOffsets` in your code will be ignored. – Michael Heil Oct 22 '20 at 09:40
  • and there is no need to add any additional Kafka options to my `readStream.format("kafka")` ? can you please quickly review my code in topic and confirm, that this config is OK for prod and checkpointLocation will do the job (" continue consuming from where it left off") ? – deeplay Oct 22 '20 at 09:45
  • 1
    yes, your code looks fine from that perspective (I did not look into the foreachBatch part). – Michael Heil Oct 22 '20 at 09:48