-1

My OS is Windows 11 and Apache Spark version is spark-3.1.3-bin-hadoop3.2

I try to use Spark structured streaming with pyspark. Belows are my simple spark structured streaming codes.

spark = SparkSession.builder.master("local[*]").appName(appName).getOrCreate()
spark.sparkContext.setCheckpointDir("/C:/tmp")

The same Spark codes without spark.sparkContext.setCheckpointDir line throws no errors on Ubuntu 22.04. However the above codes do not work successfully on Windows 11. The exemptions are

pyspark.sql.utils.IllegalArgumentException: Pathname /C:/tmp/67b1f386-1e71-4407-9713-fa749059191f from C:/tmp/67b1f386-1e71-4407-9713-fa749059191f is not a valid DFS filename.

I think the error codes mean checkpoint directory are generated on Hadoop file system of Linux, not on Windows 11. My operating system is Windows and checkpoint directory should be Windows 11 local directory. How can I configure Apache Spark checkpoint with Windows 11 local directory? I used file:///C:/temp and hdfs://C:/temp URL for test. But the errors are still thrown.

Update

I set below line to be comments.

#spark.sparkContext.setCheckpointDir("/C:/tmp") 

Then the exceptions are thrown.

WARN streaming.StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\joseph\AppData\Local\Temp\temporary-be4f3586-d56a-4830-986a-78124ab5ee74. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.

pyspark.sql.utils.IllegalArgumentException: Pathname /C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 from hdfs://localhost:9000/C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 is not a valid DFS filename.

I wonder why hdfs url contains c:/ driver letters and I want to know how to set spark.sql.streaming.forceDeleteTempCheckpointLocation to true.

halfer
  • 19,824
  • 17
  • 99
  • 186
Joseph Hwang
  • 1,337
  • 3
  • 38
  • 67

1 Answers1

0

step 1) Since you are running spark from a windows machine, make sure winutils.exe file added in hadoop bin folder reference link for same (6th Step) https://phoenixnap.com/kb/install-spark-on-windows-10.

step 2) then try to add like this spark.sparkContext.setCheckpointDir("D:\Learn\Checkpoint") spark.sparkContext.setCheckpointDir("D:\Learn\Checkpoint") Make sure spark user does have the permission to write in mentioned checkpoint directory

Devendra B
  • 56
  • 3