1

I used the following to write to google cloud storage

df.write.format("delta").partitionBy("g","p").option("delta.enableChangeDataFeed", "true").mode("append").save(path)

And then I inserted data in versions 1,2,3,4. I deleted some of the data in version 5.

Ran

deltaTable.vacuum(8)

I tried to read starting Version 3

spark.read.format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 3)
  .load(path)

Caused by: java.io.FileNotFoundException: File not found: gs://xxx/yyy.snappy.parquet It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I deleted the cluster and tried to read again. Same issue. Why is it looking for the vacuumed files?

I expected to see all the data inserted starting version 3

Bindu
  • 11
  • 3

2 Answers2

0

When running deltaTable.vacuum(8), please note that you're removing the files that are more than 8 hours old. Even if you had deleted version 5 of the table, if the files for version 3 are older than 8 hours, then the only files that would be available are the most current version (in this case version 4).

Denny Lee
  • 3,154
  • 1
  • 20
  • 33
  • Hi Denny, All the data in versions 1-5 are older than 8 hours at the time of vacuum. Note that spark.read.format("delta").load(path) loads all the data from version 1-4 minus deleted data. Something is wrong with CDC(change data capture) in delta lake(2.1.0) – Bindu Dec 09 '22 at 14:14
  • Okay, a bit of a disconnect here - if all the data files in version 1-5 are older than 8h, then all of the files, except for the most current table, would have been removed when you run the vacuum. This would be the reason why CDC would not work. – Denny Lee Dec 10 '22 at 17:11
  • I updated my description. I only deleted some of the data(most recent data) in version 5. Sorry about the confusion – Bindu Dec 11 '22 at 20:47
0

Adding the setting worked! spark.sql.files.ignoreMissingFiles ->true

Bindu
  • 11
  • 3