1

I have some dataframe df in pySpark, which results from calling:

df = spark.sql("select A, B from org_table")
df = df.stuffIdo

I want to overwrite org_table at the end of my script. Since overwriting input-tabels is forbidden, I checkpointed my data:

sparkContext.setCheckpointDir("hdfs:/directoryXYZ/PrePro_temp")
checkpointed = df.checkpoint(eager=True)

The lineage should be broken now and I can also see my checkpointed data with checkpointed.show() (works). What does not work is writing the table:

checkpointed.write.format('parquet')\
    .option("checkpointLocation", "hdfs:/directoryXYZ/PrePro_temp")\
    .mode('overwrite').saveAsTable('org_table')

This results in an error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://org_table_path/org_table/part-00081-4e9d12ea-be6a-4a01-8bcf-1e73658a54dd-c000.snappy.parquet

I have tried several things like refreshing the org_table before doing the writing etc., but I'm puzzled here. How can I solve this error?

Markus
  • 2,265
  • 5
  • 28
  • 54

1 Answers1

1

I would be careful with such operations where transformed input is new output. The reason for that is that you can lost your data in case of any error. Let's imagine that your transformation logic was buggy and you generated invalid data. But you saw that only one day later. Moreover, to fix the bug, you cannot use the data you've just transformed. You needed the data before the transformation. What do you do to bring data consistent again ?

An alternative approach would be:

  • exposing a view
  • at each batch you're writing a new table and at the end you only replace the view with this new table
  • after some days you can also plan a cleaning job that will delete the tables from last X days

If you want to stay with your solution, why not simply to do that instead of dealing with checkpointing ?

df.write.parquet("hdfs:/directoryXYZ/PrePro_temp")\
    .mode('overwrite')

df.load("hdfs:/directoryXYZ/PrePro_temp").write.format('parquet').mode('overwrite').saveAsTable('org_table')

Of course, you will read the data twice but it looks less hacky than the one with checkpoint. Moreover, you could store your "intermediate" data in different dirs every time and thanks to that you can address the issue I exposed at the beginning. Even though you had a bug, you can still bring valid version of data by simply choosing a good directory and doing .write.format(...) to org_table.

Bartosz Konieczny
  • 1,985
  • 12
  • 27