0

I recieved help from @Jacek Laskowski on how to access the latest version of a Delta table as an integer here

The code that he recommended was as follows:

from delta.tables import DeltaTable
import pyspark.sql.functions

dt = DeltaTable.forPath(spark, '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1')
latest_version = int(dt.history().select(max(col("version"))).collect()[0][0])

Can someone help me with code to get the Earliest Version of a Delta (is it even possible?)

David Makogon
  • 69,407
  • 21
  • 141
  • 189
Patterson
  • 1,927
  • 1
  • 19
  • 56
  • 1
    All versions are listed in the history. Have you tried reading version 1? Have you tried taking `min(col("version"))` instead of `max(col("version"))`? – David Makogon May 23 '23 at 18:24
  • Hi @DavidMakogon, thanks for reaching out. I used to insert version 1, but the version numbers changes, meaning version 1 may be removed after 7 days and the new earliest version may version 10 ... I hope that makes sense. In the meantime, I will try ```min(col("version")``` – Patterson May 23 '23 at 19:11
  • Hi @DavidMakogon, ```min(col("version"))`` worked. Thank you – Patterson May 23 '23 at 19:16
  • No no, this didn't work. ```min(col("version"))``` will always give me version 0. Versrion 0 in Delta Lake is always CREATE OR REPLACE TABLE ... I need to somehow get the earliest version in the HISTORY table. Is that possible – Patterson May 23 '23 at 21:35
  • Just take a look at the table. You can enumerate it, read it, sort it, take the version you want. Lots of metadata there. – David Makogon May 23 '23 at 21:44
  • The earliest version in my Delta Lake History is Version 10. Is there anything that I can do with my code that will get my the earliest version in my Delta Lake History table i.e Version 10 – Patterson May 23 '23 at 21:44
  • 1
    At this point I suggest you try something and then edit your question to show your work and where you’re stuck. Not really an appropriate use of comments to keep posting refinements to your requirements. – David Makogon May 23 '23 at 21:45
  • Hi @DavidMakogon, I will spend time thinking about overnight, but I'm not sure it's actually possible to dynamically get the earliest version, without actually knowing what the earliest version is by doing a ```DESCRIBE table``` – Patterson May 23 '23 at 21:53
  • Spend time looking at delta lake docs, and how you can query the history, including the use of sql. You absolutely can retrieve all metadata and inspect it / use it in your code. Anyway... this comments section is already getting a warning for being too long... – David Makogon May 23 '23 at 22:07
  • What are you expecting? Earliest version other than 0 ? – JayashankarGS May 26 '23 at 06:03
  • Hi @JayashankarGS, I'm sorry it's a little confusing... of course the earliest version is 0. However, version 0 is CREATE TABLE. However, after the 7 retention the earliest Version can change from 0 to 10. Therefore, Version 10 would now become the earliest version, and it is Version 10 that I would now need. Does that make sense? – Patterson May 26 '23 at 18:31

1 Answers1

1

Try below code.

from delta.tables import *
dt = DeltaTable.forPath(spark,path = "/mnt/blob/Databricks/bronze/devices/")
deleted = 0
for row in dt.history().collect():
    if row["operation"]=="VACUUM END":
        deleted = deleted + int(row["operationMetrics"]['numDeletedFiles'])
earliest_ver = deleted + 1
print(earliest_ver)

Here, it checks number of deleted files in every vacuum operations. If you try it get data of vacuumed files you will get file not found exception, so the next possible file which can be retrieved is deleted file plus one.

Output:

enter image description here

JayashankarGS
  • 1,501
  • 2
  • 2
  • 6