0

I'm trying to vacuum my Delta tables in Databricks. However, somehow it is not working and I don't understand why. This is causing our storage constantly increasing.

I have set the following table properties:

%sql
ALTER TABLE <table_name> SET TBLPROPERTIES 
("delta.deletedFileRetentionDuration" = "interval 2 hours");

%sql
ALTER TABLE <table_name> SET TBLPROPERTIES 
("delta.logRetentionDuration" = "interval 2 hours");

Then I run the following vacuum command in a Databricks notebook:

%sql
VACUUM db_name.table_name retain 2 hours

or like this:

%sql
VACUUM db_name.table_name

The files that show up in the dbfs as candidate for removal are still there after running this command.

Example of the data in the delta_log json:

{"remove":{"path":"year=2021/month=05/day=06/part-00001-52dd3cf7-9afc-46b0-9a03-7be3d1ee533e.c000.snappy.parquet","deletionTimestamp":1622798688231,"dataChange":true}

I also added some data and deleted some data for testing purposes because I read that you need to alter the table before the vacuum can be executed successfully.

What am I missing here?

MCL94
  • 21
  • 1
  • 5

2 Answers2

0

Try checkpointRetentionDuration as well . ALTER TABLE table-name

SET TBLPROPERTIES ('delta.checkpointRetentionDuration' = '7 days')

S-5618
  • 11
  • 2
0

After some research I found out I was trying to remove log files with vacuum instead of old data files. However, I did found out why our storage usages in Databricks kept increasing. It seems that Databricks creates locked container in the Azure storage account and saves cluster metrics/logs there. This accounted for 2TB of the 3TB we were using in a specific storage account.

MCL94
  • 21
  • 1
  • 5