I need to write a very large DataFrame
every two hours on a path on S3.
df.write
.partitionBy("year", "month", "day")
.mode("overwrite")
.parquet("s3://bucket/path/to/folder")
This job usually takes 30 min.
Consumers who wanted to read the data, have been experiencing file not found errors during those 30 min.
What are the most common techniques to minimize downtime for the users?