I am trying to query the Delta Lake table history as described at the following link
https://learn.microsoft.com/en-us/azure/databricks/delta/history
When I describe the delta table as follows
describe history '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1'
I get the below table output
version | timestamp | userId | userName | operation | operationParameters | job | notebook | clusterId | readVersion | isolationLevel | isBlindAppend | operationMetrics | userMetadata | engineInfo |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
"2 | 18/03/2023 12:25:54.0000000 | 615257000000000 | carlton@olvin.com | MERGE | {""predicate"":""(s.primary_key_hash = t.primary_key_hash)"",""matchedPredicates"":""[{""predicate"":""(NOT (s.change_key_hash = t.change_key_hash))"",""actionType"":""update""}]"",""notMatchedPredicates"":""[{""actionType"":""insert""}]"",""notMatchedBySourcePredicates"":""[{""actionType"":""delete""}]""} | (null) | {""notebookId"":""3807690121522291""} | 0318-105603-oyrrx3xc | 1 | WriteSerializable | False | {""numTargetRowsCopied"":""0"",""numTargetRowsDeleted"":""1"",""numTargetFilesAdded"":""1"",""numTargetBytesAdded"":""9070"",""numTargetBytesRemoved"":""9176"",""numTargetDeletionVectorsAdded"":""0"",""numTargetRowsMatchedUpdated"":""27"",""executionTimeMs"":""13999"",""numTargetRowsInserted"":""0"",""numTargetRowsMatchedDeleted"":""0"",""scanTimeMs"":""4276"",""numTargetRowsUpdated"":""27"",""numOutputRows"":""27"",""numTargetDeletionVectorsRemoved"":""0"",""numTargetRowsNotMatchedBySourceUpdated"":""0"",""numTargetChangeFilesAdded"":""0"",""numSourceRows"":""27"",""numTargetFilesRemoved"":""1"",""numTargetRowsNotMatchedBySourceDeleted"":""1"",""rewriteTimeMs"":""9012""} | (null) | Databricks-Runtime/12.2.x-scala2.12" |
"1 | 18/03/2023 12:14:43.0000000 | 615257000000000 | carlton@olvin.com | MERGE | {""predicate"":""(s.primary_key_hash = t.primary_key_hash)"",""matchedPredicates"":""[{""predicate"":""(NOT (s.change_key_hash = t.change_key_hash))"",""actionType"":""update""}]"",""notMatchedPredicates"":""[{""actionType"":""insert""}]"",""notMatchedBySourcePredicates"":""[{""actionType"":""delete""}]""} | (null) | {""notebookId"":""3807690121522291""} | 0318-105603-oyrrx3xc | 0 | WriteSerializable | False | {""numTargetRowsCopied"":""0"",""numTargetRowsDeleted"":""0"",""numTargetFilesAdded"":""1"",""numTargetBytesAdded"":""9176"",""numTargetBytesRemoved"":""0"",""numTargetDeletionVectorsAdded"":""0"",""numTargetRowsMatchedUpdated"":""0"",""executionTimeMs"":""6222"",""numTargetRowsInserted"":""28"",""numTargetRowsMatchedDeleted"":""0"",""scanTimeMs"":""2280"",""numTargetRowsUpdated"":""0"",""numOutputRows"":""28"",""numTargetDeletionVectorsRemoved"":""0"",""numTargetRowsNotMatchedBySourceUpdated"":""0"",""numTargetChangeFilesAdded"":""0"",""numSourceRows"":""28"",""numTargetFilesRemoved"":""0"",""numTargetRowsNotMatchedBySourceDeleted"":""0"",""rewriteTimeMs"":""3593""} | (null) | Databricks-Runtime/12.2.x-scala2.12" |
"0 | 18/03/2023 12:14:23.0000000 | 615257000000000 | carlton@olvin.com | CREATE OR REPLACE TABLE | {""isManaged"":""false"",""description"":null,""partitionBy"":""[]"",""properties"":""{}""} | (null) | {""notebookId"":""3807690121522291""} | 0318-105603-oyrrx3xc | (null) | WriteSerializable | True | {} | (null) | Databricks-Runtime/12.2.x-scala2.12" |
I have assigned the path with the following variable
saveloc = '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1'
As you can see from the History Output above there is field called version and operationParameters
Its easy to get the latest version from the history table with the following code:
df4 = spark.read.option("versionAsof", 3).load(saveloc)
There are various ways to get the latest version such as:
df5 = spark.read.load("/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1@v3")
Or
df6 = df5 = spark.read.load(saveloc+"@v3")
Or in SQL it would be something similar to:
SELECT * FROM saveloc@v3
Can someone let me know if it's possible to write a WHERE clause on the version field e.g
Select * From saveloc
where version > 2