Continuously Updating Partitioned Parquet

Question

I have a Spark script that pulls data from a database and writes it to S3 in parquet format. The parquet data is partitioned by date. Because of the size of the table, I'd like to run the script daily and have it just rewrite the most recent few days of data (redundancy because data may change for a couple days).

I'm wondering how I can go about writing the data to s3 in a way that only overwrites the partitions of the days I'm working with. SaveMode.Overwrite unfortunately wipes everything before it, and the other save modes don't seem to be what I'm looking for.

Snippet of my current write:

    table
      .filter(row => row.ts.after(twoDaysAgo)) // update most recent 2 days
      .withColumn("date", to_date(col("ts"))) // add a column with just date
      .write
      .mode(SaveMode.Overwrite)
      .partitionBy("date") // use the new date column to partition the parquet output
      .parquet("s3a://some-bucket/stuff") // pick a parent directory to hold the parquets

Any advice would be much appreciated, thanks!

score 0 · Answer 1 · answered Sep 14 '22 at 09:57

0

The answer I was looking for was Dynamic Overwrite, detailed in this article. Short answer, adding this line fixed my problem:

sparkConf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

answered Sep 14 '22 at 09:57

maxwellray

99
7

that feature isn't in the s3a committers, and the classic committer is neither safe nor performant. use the s3a partitioned committer. https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/committers.html#The_.E2.80.9CPartitioned.E2.80.9D_Staging_Committer – stevel Sep 19 '22 at 11:35

Continuously Updating Partitioned Parquet

1 Answers1