I have a Spark script that pulls data from a database and writes it to S3 in parquet format. The parquet data is partitioned by date. Because of the size of the table, I'd like to run the script daily and have it just rewrite the most recent few days of data (redundancy because data may change for a couple days).
I'm wondering how I can go about writing the data to s3 in a way that only overwrites the partitions of the days I'm working with. SaveMode.Overwrite unfortunately wipes everything before it, and the other save modes don't seem to be what I'm looking for.
Snippet of my current write:
table
.filter(row => row.ts.after(twoDaysAgo)) // update most recent 2 days
.withColumn("date", to_date(col("ts"))) // add a column with just date
.write
.mode(SaveMode.Overwrite)
.partitionBy("date") // use the new date column to partition the parquet output
.parquet("s3a://some-bucket/stuff") // pick a parent directory to hold the parquets
Any advice would be much appreciated, thanks!