0

I am trying to query the Delta Lake table history as described at the following link

https://learn.microsoft.com/en-us/azure/databricks/delta/history

When I describe the delta table as follows

describe history '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1'

I get the below table output

version timestamp userId userName operation operationParameters job notebook clusterId readVersion isolationLevel isBlindAppend operationMetrics userMetadata engineInfo
"2 18/03/2023 12:25:54.0000000 615257000000000 carlton@olvin.com MERGE {""predicate"":""(s.primary_key_hash = t.primary_key_hash)"",""matchedPredicates"":""[{""predicate"":""(NOT (s.change_key_hash = t.change_key_hash))"",""actionType"":""update""}]"",""notMatchedPredicates"":""[{""actionType"":""insert""}]"",""notMatchedBySourcePredicates"":""[{""actionType"":""delete""}]""} (null) {""notebookId"":""3807690121522291""} 0318-105603-oyrrx3xc 1 WriteSerializable False {""numTargetRowsCopied"":""0"",""numTargetRowsDeleted"":""1"",""numTargetFilesAdded"":""1"",""numTargetBytesAdded"":""9070"",""numTargetBytesRemoved"":""9176"",""numTargetDeletionVectorsAdded"":""0"",""numTargetRowsMatchedUpdated"":""27"",""executionTimeMs"":""13999"",""numTargetRowsInserted"":""0"",""numTargetRowsMatchedDeleted"":""0"",""scanTimeMs"":""4276"",""numTargetRowsUpdated"":""27"",""numOutputRows"":""27"",""numTargetDeletionVectorsRemoved"":""0"",""numTargetRowsNotMatchedBySourceUpdated"":""0"",""numTargetChangeFilesAdded"":""0"",""numSourceRows"":""27"",""numTargetFilesRemoved"":""1"",""numTargetRowsNotMatchedBySourceDeleted"":""1"",""rewriteTimeMs"":""9012""} (null) Databricks-Runtime/12.2.x-scala2.12"
"1 18/03/2023 12:14:43.0000000 615257000000000 carlton@olvin.com MERGE {""predicate"":""(s.primary_key_hash = t.primary_key_hash)"",""matchedPredicates"":""[{""predicate"":""(NOT (s.change_key_hash = t.change_key_hash))"",""actionType"":""update""}]"",""notMatchedPredicates"":""[{""actionType"":""insert""}]"",""notMatchedBySourcePredicates"":""[{""actionType"":""delete""}]""} (null) {""notebookId"":""3807690121522291""} 0318-105603-oyrrx3xc 0 WriteSerializable False {""numTargetRowsCopied"":""0"",""numTargetRowsDeleted"":""0"",""numTargetFilesAdded"":""1"",""numTargetBytesAdded"":""9176"",""numTargetBytesRemoved"":""0"",""numTargetDeletionVectorsAdded"":""0"",""numTargetRowsMatchedUpdated"":""0"",""executionTimeMs"":""6222"",""numTargetRowsInserted"":""28"",""numTargetRowsMatchedDeleted"":""0"",""scanTimeMs"":""2280"",""numTargetRowsUpdated"":""0"",""numOutputRows"":""28"",""numTargetDeletionVectorsRemoved"":""0"",""numTargetRowsNotMatchedBySourceUpdated"":""0"",""numTargetChangeFilesAdded"":""0"",""numSourceRows"":""28"",""numTargetFilesRemoved"":""0"",""numTargetRowsNotMatchedBySourceDeleted"":""0"",""rewriteTimeMs"":""3593""} (null) Databricks-Runtime/12.2.x-scala2.12"
"0 18/03/2023 12:14:23.0000000 615257000000000 carlton@olvin.com CREATE OR REPLACE TABLE {""isManaged"":""false"",""description"":null,""partitionBy"":""[]"",""properties"":""{}""} (null) {""notebookId"":""3807690121522291""} 0318-105603-oyrrx3xc (null) WriteSerializable True {} (null) Databricks-Runtime/12.2.x-scala2.12"

I have assigned the path with the following variable

saveloc = '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1'

As you can see from the History Output above there is field called version and operationParameters

Its easy to get the latest version from the history table with the following code:

df4 = spark.read.option("versionAsof", 3).load(saveloc)

There are various ways to get the latest version such as:

df5 = spark.read.load("/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1@v3")

Or

df6 = df5 = spark.read.load(saveloc+"@v3")

Or in SQL it would be something similar to:

SELECT * FROM saveloc@v3

Can someone let me know if it's possible to write a WHERE clause on the version field e.g

Select * From saveloc
where version > 2
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Patterson
  • 1,927
  • 1
  • 19
  • 56

2 Answers2

1

This is not possible. Let's imagine you have a table with 5 versions. If you use a query like

Select * From saveloc
where version > 2

which version would you expect to see, 3, 4 or 5? You need to specify a version.

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
  • Hi @Robert, thanks for reaching out. The ultimate goal is to compare the latest version with a version just below the latest version e.g I would want to compare, say version 5 with version 4. But I don't want to have to actually specify version 5. Does that make sense? – Patterson Mar 18 '23 at 17:41
  • You can run a select * from deltatable union select * from deltatable version as of 4 as when you do not specify the version you get the latest version. – Denny Lee Mar 18 '23 at 18:08
0
select * from deltatable version as of 9
John Stud
  • 1,506
  • 23
  • 46
  • Hi @John Stud, thanks for reaching out. Is there where of select the latest version without actually having to write ```select * from deltatable version as of 9``` – Patterson Mar 18 '23 at 15:51
  • Hi, I meant to say is there a WAY of selecting the version number without specifically specifying the version number .. e.g ```where version as of > 5``` – Patterson Mar 18 '23 at 16:13
  • If it's not possible to select the version number with a WHERE clause, can someone let me know if its possible to select the TIMESTAMP of the History table without actually specifying the actual TIMESTAMP e.g ```where timestamp between date1 and date2``` – Patterson Mar 18 '23 at 16:27