Option 'basePath' must be a directory Scala

Question

I have this code and it's giving the error basepath must be a dir. Just want to run a simple streaming kafkaSink.

val checkPointDir = "/tmp/offsets/" // "hdfs://hdfscluster/user/yarn/tmp/"

    def main(args: Array[String]): Unit ={
            lazy val spark = SparkSession
              .builder
              .appName("KafkaProducer")
              .master("local[*]")
              .getOrCreate()



           val query = writeStream(jsonDF, "test")
            query.awaitTermination()
          }


      def writeStream(df:DataFrame, topic:String): StreamingQuery = {
    //    log.warn("Writing to kafka")
        df
          //      .selectExpr( "CAST(value AS STRING)")
          .writeStream
          .format("kafka")
          .option("kafka.bootstrap.servers", kafkaServers)
          .option("topic", topic)
          .option("checkpointLocation", checkPointDir)
          .outputMode(OutputMode.Update)
          .start()
      }

My user is the owner of this folder /tmp/offsets. I'm getting this exception.

java.lang.IllegalArgumentException: Option 'basePath' must be a directory

Possible duplicate of [Error: java.lang.IllegalArgumentException: Option 'basePath' must be a directory](https://stackoverflow.com/questions/48357753/error-java-lang-illegalargumentexception-option-basepath-must-be-a-directory) — pushpavanthar, Jul 03 '18 at 09:59
can you try providing canonical path of the file like this `new File(path).getCanonicalFile` — pushpavanthar, Jul 03 '18 at 11:30
which file? the one I make dataframe from? Im using a local file and read the file using spark.readStream.text(PATH_TO_FILE) and getCanonicalPath also returns the same path . — Sam, Jul 03 '18 at 12:20
/home/user/Documents/fn/Proto2/src/main/resource/events-identification-carrier-a.txt its like this — Sam, Jul 03 '18 at 12:28
Oh got it! Fixed it! yes, i was giving the file name as well. But now I have a question why we only need to give the path to the DIR and not specify the filename? What if there are multiple files in the same Dir? — Sam, Jul 03 '18 at 12:31

score 0 · Accepted Answer · answered Jul 03 '18 at 16:57

"checkpointLocation" should be given a canonical path of a directory.

This directory is used to store the actual intermediate RDDs. There can be more than one RDD stored given there are multiple checkpoints. Each RDD's data is stored in a separate directory. However, RDDs themselves are partitioned, each partition is stored in a separate file inside the RDD directory. When storing files in HDFS, Spark has to abide by the max block size property. Storing such structured data is not possible in a single file, hence directory.

Option 'basePath' must be a directory Scala

1 Answers1