1

for one of my use cases I am using change data feed (CDF) feature of delta lake,it goes all well with CDF but when i read all the data to insert in gold, it lists all the versions, is there a way i can read only the latest version without specifying version number or a way to fetch latest version ?

        return spark.read.format("delta") \
                  .option("readChangeFeed", "true") \
                  .table(tableName) \
                  .where(col("_change_type") != "preimage")

above code block returns results from all versions since start, i can fetch only latest data by looking into the table and specifying the version but i don't understand how to enable this in production, I don't want to use timestamp to fetch the latest version as in case of retries some one might run the pipeline multiple times a day and this will bring data inaccuracies if not handled as 1st run of the day. Any help would be appreciated.

ray
  • 11,310
  • 7
  • 18
  • 42
newbee123
  • 21
  • 1
  • 2

1 Answers1

1

We can write a query for row level modifications to get the different versions of a delta table.

As Tim posted in an answer to a similar Stack Overflow question, you can read it as a stream like the following:

spark.readStream
     .format("delta")
     .option("readChangeFeed", "true")
     .option("startingVersion", "latest")
     .table(tableName) 
     .filter("_change_type != 'update_preimage'")
Ryan M
  • 18,333
  • 31
  • 67
  • 74
RajkumarPalnati
  • 541
  • 2
  • 6
  • eventually the "latest" tag isn't working for me, so i have used the history table to fetch the version specific to the day. Code snippet in another comment. – newbee123 Jun 09 '22 at 21:26
  • ` date_dict = self.calc_day(minus_n_days) process_date = date_dict['process_date'] full_history_df = DeltaTable.forName(self.spark, table_name).history() latest_version_row = full_history_df.filter(to_date(col("timestamp")) == date_dict['process_date']).selectExpr("max(version)").collect() latest_version = latest_version_row[0][0]` – newbee123 Jun 09 '22 at 21:30
  • Hope you enabled the CDF feature i.e, **TBLPROPERTIES & table_changes()** as per the [recommendations](https://stackoverflow.com/a/70227670/18935775) – RajkumarPalnati Jun 10 '22 at 04:32
  • yes, it's working correctly now. – newbee123 Jun 10 '22 at 12:34