0

I want to save Spark DataFrame in Delta format to S3, however, for some reason, the data is not saved. I debugged all the processing steps there was data and right before saving it, I ran count on the DataFrame which returned 24 rows. But as soon as save is called no data appears in the resulting folder. What could be the reason for it?

This is how I save the data:

df
  .select(schema)
  .repartition(partitionKeys.map(new ColumnName(_)): _*)
  .sortWithinPartitions(sortByKeys.map(new ColumnName(_)): _*)
  .write
  .format("delta")
  .partitionBy(partitionKeys: _*)
  .mode(saveMode)
  .save("s3a://etl-qa/data_feed")
Cassie
  • 2,941
  • 8
  • 44
  • 92
  • 1
    Can you replace `format` and `path` to their values so it's clear what you do? How do you check that _"no data appears in the resulting folder"_? – Jacek Laskowski Dec 15 '20 at 20:32
  • 1
    Also, what is `saveMode` ? Have you validated if writing works without repartition, sortWithinPartitions and partitionBy ? – Michael Heil Dec 16 '20 at 09:57

1 Answers1

1

There is a quick start from Databricks that explains how to read and write from and to a delta lake.

If the Dataframe you are trying to save is called df you need to execute:

df.write.format("delta").save(s3path)
Michael Heil
  • 16,250
  • 3
  • 42
  • 77