Unpartitioned data has persisted in S3 after adding partition to delta table

Question

I have a delta table with a destination in S3 created as such:

(
    df
    .write
    .mode('append')
    .option('path', 's3://<s3_path')
    .saveAsTable(<table_name>)
)

The data was originally written unpartitioned and I sought to add a partition as such:

(
    spark
    .table(<table_name>)
    .write
    .mode('overwrite')
    .option('overwriteSchema', 'true')
    .option('path', 's3://<s3_path>')
    .partitionBy(<partition_column>)
    .saveAsTable(<table_name>)
)

The data now exists in s3_path in its unpartitioned form (all *.snappy.parquet files at the prefix) and in it's partitioned form partition_column=X,Y,Z. How can I go about cleaning up the unpartitioned "old" data without affecting the table? I don't feel like manually deleting those .parquet files is correct, and a VACUUM command isn't working.

I disabled the VACUUM retention duration check as such and ran a dry run vacuum retaining only latest 3 hours (it's been 2 hours since re-partition overwrite according to DESCRIBE HISTORY <table_name), but returned no files. I've also confirmed through table versions that the row count of all append operations is equal to the number of rows created in the overwrite with partitions.

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM <table_name> RETAIN 3 HOURS DRY RUN;

VACUUM should cleanup it, but it by default waits for a week (default retention time). You can set it to lower value although — Alex Ott, Feb 23 '22 at 19:04

Unpartitioned data has persisted in S3 after adding partition to delta table

0 Answers0