0

In my pyspark notebook, I

  1. read from tables to create data frames
  2. aggregate the data frame
  3. write to a folder
  4. create a SQL table from the output folder

For

#1. I do `spark.read.format("delta").load(path)`
#2. I  do `df = df.filter.(...).groupby(...).agg(...)
#3. I do `df.write.format("delta").mode("append").save(output_folder)`
#4. I do `spark.sql(f"CREATE TABLE IF NOT EXISTS {sql_database}.{table_name} USING delta LOCATION '{path to output folder in #3}'") `

The issue I am having is I have debug print statements before and after step 3 and 4. And I check that there are parquet files in my output folder and the Path I use in creating the SQL table is correct. And there is no exception in the console.

But when I try to see look for that newly create table in SQL tool, I can't see it.

Can you please tell me if I need to wait for the 'write' to be done before I create the SQL tables? If yes, what do I need to do to wait for the write to be done?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
n179911a
  • 125
  • 1
  • 8

1 Answers1

1

By default, the .write is synchronous operation, so step 4 will be executed only after step 3 is done. But really, it's easier to write & create a table in one step by using .saveAsTable together with the path option:

df = spark.read.format("delta").load(path)
df = df.filter.(...).groupby(...).agg(...)
df.write.format("delta").mode("append") \
  .option("path", output_folder) \
  .saveAsTable(f"{sql_database}.{table_name}")
Alex Ott
  • 80,552
  • 8
  • 87
  • 132