2

I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.

For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.

Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360. What can remove the old checkpoints? How can I prevent that?

I'm using Azure Databricks 7.3 LTS ML.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Nastasia
  • 557
  • 3
  • 22
  • I don't even like Azure, but I'm voting this up because I'm so pissed delta lake defaults to finite retention. Optimizing, I could understand, but I can't believe a rollback platform cuts off access after 30 days by default. BEWARE! – combinatorist Aug 29 '23 at 23:01

2 Answers2

1

Ability to perform time travel isn't directly related to the checkpoint. Checkpoint is just an optimization that allows to quickly access metadata as Parquet file without need to scan individual transaction log files. This blog post describes the details of the transaction log in more details

The commits history is retained by default for 30 days, and could be customized as described in documentation. Please note that vacuum may remove deleted files that are still referenced in the commit log, because data is retained only for 7 days by default. So it's better to check corresponding settings.

If you perform following test, then you can see that you have history for more than 10 versions:

df = spark.range(10)
for i in range(20):
  df.write.mode("append").format("delta").save("/tmp/dtest")
  # uncomment if you want to see content of log after each operation
  #print(dbutils.fs.ls("/tmp/dtest/_delta_log/"))

then to check files in log - you should see both checkpoints and files for individual transactions:

%fs ls /tmp/dtest/_delta_log/

also check the history - you should have at least 20 versions:

%sql

describe history delta.`/tmp/dtest/`

and you should be able to go to the early version:

%sql

SELECT * FROM delta.`/tmp/dtest/` VERSION AS OF 1
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thank you, Alex, for your answer! I don't run any vacuum command and I have a config for 365 days. Yet, my data is like "removing" regularly. It happens every 10 commits. It is linked to the bug https://stackoverflow.com/questions/66561637/i-receive-the-error-cannot-time-travel-delta-table-to-version-x-whereas-i-can. – Nastasia Mar 18 '21 at 14:25
  • you don't have any expiration policies for that location? Otherwise I would recommend to open ticket with Azure – Alex Ott Mar 18 '21 at 14:46
  • I don't. Thank you for your answer and your help. I'm going to open a ticket in Azure. – Nastasia Mar 18 '21 at 15:20
  • I would clarify that the ability to time travel or restore IS directly connected to checkpoint EXISTENCE and thus your checkpoint retention configuration: https://github.com/delta-io/delta/issues/616 – combinatorist Aug 29 '23 at 22:27
1

If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:

spark.sql(f"""
        ALTER TABLE delta.`path`
            SET TBLPROPERTIES (
                delta.checkpointRetentionDuration = 'X days'
            )
        """
)
Nastasia
  • 557
  • 3
  • 22