0

I do have a simple setup of reading from Kafka and writing to local console:

SparkSession is created with .master("local[*]") and I start the stream with:

var df = spark.readStream.format("kafka").options(...).load()
df = df.select("some_column")

df.writeStream.format("console")
  .outputMode("append")
  .start()
  .awaitTermination()

The same Kafka setup works perfectly fine when using with batch/normal DataFrame, but for this streaming job I do get the exception: Permission denied: user=user, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x

Why does it want access to HDFS, when I want to get the data locally to the console? And how can I solve this?

Seb
  • 1,586
  • 1
  • 11
  • 15
  • I assume you're running using spark-submit without any custom options? What Spark version? – OneCricketeer Apr 27 '22 at 13:46
  • 1
    You may need to specify the location of the checkpoint directory. See [here](https://stackoverflow.com/questions/65574234/read-data-from-kafka-and-print-to-console-with-spark-structured-sreaming-in-pyth). – Hristo Iliev Apr 27 '22 at 15:10
  • @HristoIliev I tried, but every location will get interpreted as HDFS location, where I still don't have write access. – Seb Apr 30 '22 at 10:33
  • @OneCricketeer Right assumption. Spark version = "3.2.1" and Scala version = "2.12.10" – Seb Apr 30 '22 at 10:36
  • Try adding the `checkpointLocation` option; see https://stackoverflow.com/a/44889408/1843329 – snark Jul 28 '22 at 15:20

0 Answers0