0

I have a delta table with 4 versions.

DESCRIBE HISTORY cfm ---> has 4 versions. 0,1,2,3.

I want to delete version 3 or 2. How can I achieve this?

i tried

from delta.tables import *
from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, "path of cfm files")

deltaTable.delete("'version' = '3'") 

This does not delete the version 3. https://docs.delta.io/0.4.0/delta-update.html says

"delete removes the data from the latest version of the Delta table but does not remove it from the physical storage until the old versions are explicitly vacuumed"

If i have to run vacuum command how to use them on latest dates and not older dates.

nl09
  • 93
  • 1
  • 9

1 Answers1

1

You need to use the vaccum command to perform this operation. However the default retention for vaccum is for 7 days and it will error out if you are trying to vaccum anything within 7 days.

We can work around this by setting a spark configuration that will bypass the default retention period check.

solution below:

from delta.tables import *

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
deltaTable = DeltaTable.forPath(spark, deltaPath)
deltaTable.vacuum(24)

*deltaPath -- is the path for your delta table

*24 -- indicates the number of hours up until which your versioning is retained , any versions which were created beyond 24 hours in the past would get deleted.

Hope this answers your question.

Solomon
  • 11
  • 1
  • This will delete any versions created before 24 versions. My question is if we can delete only latest version keeping the older version as it is? – nl09 Apr 13 '21 at 18:36
  • it will delete any versions created before 24 'hours' , not versions. so if you require only the latest version to be retained - determine what is the interval at which each update happens to the table and set that time interval in the place of 24. Eg If your table gets updated every 1 hour , you would require only the latest , set the number to one. – Solomon Apr 13 '21 at 20:03