DeltaLake: How to Time Travel infinitely across Datasets?

Question

The Use Case:

Store versions of Large Datasets (CSV/Snowflake Tables) and query across versions

DeltaLake says that unless we run vacuum command we retain historical information in a DeltaTable. And Log files are deleted every 30 days. Here

And Additional Documentation states that we need both the log files and DataFiles to time travel. here

Does this imply that we can only time travel 30 days?

But isn't Delta a file format? How would it automatically delete it's logs?

If yes, what are the other open source versions that can solve querying across dataset versions.?

score 3 · Accepted Answer · answered Mar 04 '21 at 10:09

Just set the data and log retention settings to a very long period.

alter table delta.`/path/to/table` set TBLPROPERTIES ('delta.logRetentionDuration'='interval 36500000 days', 'delta.deletedFileRetentionDuration'='interval 36500000 days')

spark.sql("alter table delta.`{table_path}` set TBLPROPERTIES ("
                      "'delta.logRetentionDuration'='interval {log_retention_days} days', "
                      "'delta.deletedFileRetentionDuration'='interval {data_rentention_days} days');".format(
                        table_path="path/to/table",
                        log_retention_days=36000000,
                        data_rentention_days=36000000))

does interval accept durations in hours/years etc.? – Sushant Pachipulusu Jun 21 '22 at 11:04 — Sushant Pachipulusu, Jun 21 '22 at 11:04

score 0 · Answer 2 · edited Oct 10 '20 at 07:15

Databricks has open sourced deltalake project in Apr'2018 (Open source deltalake project still to get some functionalities like data skipping etc) Details: Deltalake, Docs, Github Repo

Delta is not file format - it is storage layer on top of parquet & metadata (in json format) files.

It doesn't delete files automatically. Vacuum operation should be performed to delete older & not referenced (not active) file. So without running 'vacuum' operation, you can time travel infinitely as all data would be available. On other hand, if you perform 'vacuum' with 30 days retention, you can access last 30 days data.

Yes, it solves querying across dataset versions. Each version can be identified by timestamp. Sample queries to access specific version data:

Scala:

val df = spark.read
  .format("delta")
  .option("timestampAsOf", "2020-10-01")
  .load("/path/to/my/table")

Python:

df = spark.read \
  .format("delta") \
  .option("timestampAsOf", "2020-10-01") \
  .load("/path/to/my/table")

SQL:

SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01"
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01 01:30:00.000"

(Note: I am using open sourced deltalake in production for multiple use cases)

DeltaLake: How to Time Travel infinitely across Datasets?

2 Answers2