0

I need to write a very large DataFrame every two hours on a path on S3.

df.write
  .partitionBy("year", "month", "day")
  .mode("overwrite")
  .parquet("s3://bucket/path/to/folder")

This job usually takes 30 min.

Consumers who wanted to read the data, have been experiencing file not found errors during those 30 min.

What are the most common techniques to minimize downtime for the users?

Michel Hua
  • 1,614
  • 2
  • 23
  • 44
  • Assuming you want the users to be able to read the data only once the entire write finished I suggest to take a look at table formats such as Apache Iceberg or Delta Lake. These formats will enable the write to be visible only once it is fully written. – Guy Oct 29 '22 at 00:22
  • Can't change formats, my question is about parquet. – Michel Hua Oct 29 '22 at 07:10
  • The s3 replace is atomic according to this answer - https://stackoverflow.com/questions/30246784/aws-s3-replace-file-atomically Not sure how spark write's it though. – ns15 Oct 29 '22 at 09:35
  • maybe you want to write it somewhere else and than call `aws s3 mv` once it's done writing. – 0x26res Oct 29 '22 at 09:55
  • If you want to continue using parquet then you can also try the following: 1. Define the table in hive metastore (or some other catalog like aws glue) 2. Add the partition to hive only once the above write is finished using ALTER TABLE ADD PARTITIONS (https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html) Clients will need to access the table using the table name and not the underlying path anymore and this way they will be able to access only partitions which were add to the table explicitly – Guy Oct 29 '22 at 15:07

0 Answers0