I am running OPTIMIZE command for compaction. Now I want to delete the old files left out after compaction. So if I use vacuum command, then I am not able to do time travel. So, what is the better way to delete old files left out due to compaction without losing ability to time travel?
Asked
Active
Viewed 412 times
1 Answers
1
It depends on what you are trying to achieve. Time travel is really meant for shorter-term debugging as opposed to long-term storage per se. If you would like to keep the data around for the long-term, perhaps make use of Delta CLONE
per Attack of the Delta Clones (Against Disaster Recovery Availability Complexity).

Denny Lee
- 3,154
- 1
- 20
- 33
-
Thanks. Actually, I have streaming data, so bunch of operations are being performed on it. I want to leverage the benefit of time travel for data replay. But I'm running OPTIMIZE command for compaction, so I've redundant data, i.e, small uncompacted files and large compacted files having same data.To avoid this redundancy of data, I'm running vacuum command to delete the old files. But when I run vacuum command, old data left due to other operations is also deleted, so losing ability to time travel. So Is there any way to delete the old data left after compaction without losing time travel? – Priyanshu May 02 '21 at 04:13
-
Right now there is not a way to do this though perhaps suggest that on the delta.io GitHub issue (or even provide a PR?). When running VACUUM, you're going to remove all old files. Saying this, note that after the OPTIMIZE command (and compaction), any new queries will use the new files, not the old files. So in your scenario, yes there will be old files sitting there for a while but whenever you do your VACUUM job (e.g. 7 days later), then those files will be removed. – Denny Lee May 02 '21 at 16:59
-
Yeah that makes sense, but the problem is when I run vacuum, the old delta-table (the table state before other operations like update table etc.) is also deleted, so I will lose the ability to time travel to previous versions , but I want the previous versions of table as well as want to delete the redundant data left due to compaction. (Because compaction is just basically a re-arrangement of data, so keeping old data after compaction is huge storage overhead) – Priyanshu May 03 '21 at 11:31
-
When you say it's a huge storage overhead - if there are no queries hitting the files, then the key issue is more of a matter of storage costs for temporarily holding this data. Are the data volumes large enough that this is of significant concern? But saying this, completely grok you hence the suggestion to create a GitHub issue so we (the Delta community) can prioritize it :) – Denny Lee May 04 '21 at 01:44
-
Yeah, the data volume is large enough, maybe TBs of data everyday, that's why data redundancy is a problem. Anyways, Thanks a lot. – Priyanshu May 04 '21 at 04:50
-
Ah, that's fair enough - then sure, please do create a GitHub issue :) – Denny Lee May 04 '21 at 18:35