My OS is Windows 11 and Apache Spark version is spark-3.1.3-bin-hadoop3.2
I try to use Spark structured streaming with pyspark. Belows are my simple spark structured streaming codes.
spark = SparkSession.builder.master("local[*]").appName(appName).getOrCreate()
spark.sparkContext.setCheckpointDir("/C:/tmp")
The same Spark codes without spark.sparkContext.setCheckpointDir
line throws no errors on Ubuntu 22.04. However the above codes do not work successfully on Windows 11. The exemptions are
pyspark.sql.utils.IllegalArgumentException: Pathname /C:/tmp/67b1f386-1e71-4407-9713-fa749059191f from C:/tmp/67b1f386-1e71-4407-9713-fa749059191f is not a valid DFS filename.
I think the error codes mean checkpoint directory are generated on Hadoop file system of Linux, not on Windows 11. My operating system is Windows and checkpoint directory should be Windows 11 local directory. How can I configure Apache Spark checkpoint with Windows 11 local directory? I used file:///C:/temp
and hdfs://C:/temp
URL for test. But the errors are still thrown.
Update
I set below line to be comments.
#spark.sparkContext.setCheckpointDir("/C:/tmp")
Then the exceptions are thrown.
WARN streaming.StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\joseph\AppData\Local\Temp\temporary-be4f3586-d56a-4830-986a-78124ab5ee74. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
pyspark.sql.utils.IllegalArgumentException: Pathname /C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 from hdfs://localhost:9000/C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 is not a valid DFS filename.
I wonder why hdfs url contains c:/
driver letters and I want to know how to set spark.sql.streaming.forceDeleteTempCheckpointLocation
to true
.