Spark Structure Streaming checkpoint vs spark context CheckPointDir

Question

Hello stack overflow community.

I'm using a spark streaming app in production environment and it was noticed that spark-checkpoints are contributing greatly to the under replication factor in HDFS and thus affects the HDFS stability. I'm trying to investigate a proper solution to clean spark check points regularly and not being through manual hdfs delete. I referred to couple of posts : Spark Structured Streaming Checkpoint Cleanup and Spark structured streaming checkpoint size huge So what I came up with is that I would set up the spark checkpoint directory and the spark structured streaming checkpoint directory referring to the same path and set the cleaning configuration to true. This solution will create a spark check point per spark context. I'm doubting that this might contradict with the purpose of check pointing but I'm still trying to understand the internals of spark and would appreciate any guidance here. Below is snippest of my code

    spark.sparkContext.setCheckpointDir(checkPointLocation)
    val options = Map("checkpointLocation" -> s"${spark.sparkContext.getCheckpointDir.get }")


    val q = df.writeStream
      .options(options)
      .trigger(trigger)
      .queryName(queryName)

Spark Structure Streaming checkpoint vs spark context CheckPointDir

0 Answers0