I have a delta table with a destination in S3 created as such:
(
df
.write
.mode('append')
.option('path', 's3://<s3_path')
.saveAsTable(<table_name>)
)
The data was originally written unpartitioned and I sought to add a partition as such:
(
spark
.table(<table_name>)
.write
.mode('overwrite')
.option('overwriteSchema', 'true')
.option('path', 's3://<s3_path>')
.partitionBy(<partition_column>)
.saveAsTable(<table_name>)
)
The data now exists in s3_path
in its unpartitioned form (all *.snappy.parquet
files at the prefix) and in it's partitioned form partition_column=X,Y,Z
. How can I go about cleaning up the unpartitioned "old" data without affecting the table? I don't feel like manually deleting those .parquet
files is correct, and a VACUUM
command isn't working.
I disabled the VACUUM
retention duration check as such and ran a dry run vacuum retaining only latest 3 hours (it's been 2 hours since re-partition overwrite according to DESCRIBE HISTORY <table_name
), but returned no files. I've also confirmed through table versions that the row count of all append operations is equal to the number of rows created in the overwrite with partitions.
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM <table_name> RETAIN 3 HOURS DRY RUN;