1

I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains until data are fixed. I'm running a lot of test but ultimately I have to actually write the files as delta table with the following loop (simplified for this question):

for f in files:
    # read csv 
    df = spark.read.csv(f, header=True, schema=schema)
    # writing to already existing delta table
    df.write.format("delta").save('path/' + f)

Is there a callback mechanism so the write method is executed only if all the dataframe doesn't returns any errors? Delta table schema enforcement is pretty rigid which is great, but errors can pop at any time despite all the test I'm running before passing these files in this loop.

union is not an option because I want to handle this by date and each files has different schemas and names.

Simon Breton
  • 2,638
  • 7
  • 50
  • 105
  • Are these `delta` tables already been created? Or are you creating them newly? – Dipanjan Mallick Mar 18 '22 at 11:10
  • yes. before deploying my notebook and running it daily these delta will already be created with schema I'm using here to read each CSV – Simon Breton Mar 18 '22 at 13:05
  • Okay. One solution I can think of , is you could iterate over `spark.catalog.listTables()()` and check whether table name is matching with your `csv` filename. Having said that, then in the second step you could run `COPY INTO` command or even I presume yours would also work. Please note, if no database is specified, the current database is used. This includes all temporary views. – Dipanjan Mallick Mar 18 '22 at 13:49

1 Answers1

0

You can use df.union() or df.unionByName() to read all of your files into a single dataframe. Then that one is either written fully or fails.

# Create empty dataframe with schema to fill up
emptyRDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emptyRDD,schema)

for f in files:
    # read csv 
    dfNext = spark.read.csv(f, header=True, schema=schema)
    df = df.unionByName(dfNext)

df.write.format("delta").save(path)
restlessmodem
  • 418
  • 3
  • 12