How to wait for df.write() to be complete

Question

In my pyspark notebook, I

read from tables to create data frames
aggregate the data frame
write to a folder
create a SQL table from the output folder

For

#1. I do `spark.read.format("delta").load(path)`
#2. I  do `df = df.filter.(...).groupby(...).agg(...)
#3. I do `df.write.format("delta").mode("append").save(output_folder)`
#4. I do `spark.sql(f"CREATE TABLE IF NOT EXISTS {sql_database}.{table_name} USING delta LOCATION '{path to output folder in #3}'") `

The issue I am having is I have debug print statements before and after step 3 and 4. And I check that there are parquet files in my output folder and the Path I use in creating the SQL table is correct. And there is no exception in the console.

But when I try to see look for that newly create table in SQL tool, I can't see it.

Can you please tell me if I need to wait for the 'write' to be done before I create the SQL tables? If yes, what do I need to do to wait for the write to be done?

score 1 · Accepted Answer · answered Jun 10 '23 at 19:02

By default, the .write is synchronous operation, so step 4 will be executed only after step 3 is done. But really, it's easier to write & create a table in one step by using .saveAsTable together with the path option:

df = spark.read.format("delta").load(path)
df = df.filter.(...).groupby(...).agg(...)
df.write.format("delta").mode("append") \
  .option("path", output_folder) \
  .saveAsTable(f"{sql_database}.{table_name}")

How to wait for df.write() to be complete

1 Answers1