I have some dataframe df
in pySpark, which results from calling:
df = spark.sql("select A, B from org_table")
df = df.stuffIdo
I want to overwrite org_table
at the end of my script.
Since overwriting input-tabels is forbidden, I checkpointed my data:
sparkContext.setCheckpointDir("hdfs:/directoryXYZ/PrePro_temp")
checkpointed = df.checkpoint(eager=True)
The lineage should be broken now and I can also see my checkpointed data with checkpointed.show()
(works). What does not work is writing the table:
checkpointed.write.format('parquet')\
.option("checkpointLocation", "hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite').saveAsTable('org_table')
This results in an error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://org_table_path/org_table/part-00081-4e9d12ea-be6a-4a01-8bcf-1e73658a54dd-c000.snappy.parquet
I have tried several things like refreshing the org_table before doing the writing etc., but I'm puzzled here. How can I solve this error?