How to fetch the latest version number of a delta table

Question

for one of my use cases I am using change data feed (CDF) feature of delta lake,it goes all well with CDF but when i read all the data to insert in gold, it lists all the versions, is there a way i can read only the latest version without specifying version number or a way to fetch latest version ?

        return spark.read.format("delta") \
                  .option("readChangeFeed", "true") \
                  .table(tableName) \
                  .where(col("_change_type") != "preimage")

above code block returns results from all versions since start, i can fetch only latest data by looking into the table and specifying the version but i don't understand how to enable this in production, I don't want to use timestamp to fetch the latest version as in case of retries some one might run the pipeline multiple times a day and this will bring data inaccuracies if not handled as 1st run of the day. Any help would be appreciated.

score 1 · Answer 1 · edited Aug 24 '22 at 09:26

1

We can write a query for row level modifications to get the different versions of a delta table.

As Tim posted in an answer to a similar Stack Overflow question, you can read it as a stream like the following:

spark.readStream
     .format("delta")
     .option("readChangeFeed", "true")
     .option("startingVersion", "latest")
     .table(tableName) 
     .filter("_change_type != 'update_preimage'")

edited Aug 24 '22 at 09:26

Ryan M

18,333
31
67
74

answered Jun 06 '22 at 11:43

RajkumarPalnati

541
2
6

eventually the "latest" tag isn't working for me, so i have used the history table to fetch the version specific to the day. Code snippet in another comment. – newbee123 Jun 09 '22 at 21:26
` date_dict = self.calc_day(minus_n_days) process_date = date_dict['process_date'] full_history_df = DeltaTable.forName(self.spark, table_name).history() latest_version_row = full_history_df.filter(to_date(col("timestamp")) == date_dict['process_date']).selectExpr("max(version)").collect() latest_version = latest_version_row[0][0]` – newbee123 Jun 09 '22 at 21:30
Hope you enabled the CDF feature i.e, **TBLPROPERTIES & table_changes()** as per the [recommendations](https://stackoverflow.com/a/70227670/18935775) – RajkumarPalnati Jun 10 '22 at 04:32
yes, it's working correctly now. – newbee123 Jun 10 '22 at 12:34

How to fetch the latest version number of a delta table

1 Answers1

Linked