0

I have this code and it's giving the error basepath must be a dir. Just want to run a simple streaming kafkaSink.

val checkPointDir = "/tmp/offsets/" // "hdfs://hdfscluster/user/yarn/tmp/"

    def main(args: Array[String]): Unit ={
            lazy val spark = SparkSession
              .builder
              .appName("KafkaProducer")
              .master("local[*]")
              .getOrCreate()



           val query = writeStream(jsonDF, "test")
            query.awaitTermination()
          }


      def writeStream(df:DataFrame, topic:String): StreamingQuery = {
    //    log.warn("Writing to kafka")
        df
          //      .selectExpr( "CAST(value AS STRING)")
          .writeStream
          .format("kafka")
          .option("kafka.bootstrap.servers", kafkaServers)
          .option("topic", topic)
          .option("checkpointLocation", checkPointDir)
          .outputMode(OutputMode.Update)
          .start()
      }

My user is the owner of this folder /tmp/offsets. I'm getting this exception.

java.lang.IllegalArgumentException: Option 'basePath' must be a directory

Sam
  • 497
  • 1
  • 10
  • 34
  • Possible duplicate of [Error: java.lang.IllegalArgumentException: Option 'basePath' must be a directory](https://stackoverflow.com/questions/48357753/error-java-lang-illegalargumentexception-option-basepath-must-be-a-directory) – pushpavanthar Jul 03 '18 at 09:59
  • Nope! I've already tried that. Its different from that. – Sam Jul 03 '18 at 11:24
  • can you try providing canonical path of the file like this `new File(path).getCanonicalFile` – pushpavanthar Jul 03 '18 at 11:30
  • which file? the one I make dataframe from? Im using a local file and read the file using spark.readStream.text(PATH_TO_FILE) and getCanonicalPath also returns the same path . – Sam Jul 03 '18 at 12:20
  • /home/user/Documents/fn/Proto2/src/main/resource/events-identification-carrier-a.txt its like this – Sam Jul 03 '18 at 12:28
  • Oh got it! Fixed it! yes, i was giving the file name as well. But now I have a question why we only need to give the path to the DIR and not specify the filename? What if there are multiple files in the same Dir? – Sam Jul 03 '18 at 12:31

1 Answers1

0

"checkpointLocation" should be given a canonical path of a directory.

This directory is used to store the actual intermediate RDDs. There can be more than one RDD stored given there are multiple checkpoints. Each RDD's data is stored in a separate directory. However, RDDs themselves are partitioned, each partition is stored in a separate file inside the RDD directory. When storing files in HDFS, Spark has to abide by the max block size property. Storing such structured data is not possible in a single file, hence directory.

pushpavanthar
  • 819
  • 6
  • 20