I see in the 'data' tab of databricks that the number of files used by delta table is 20000(size:1.6TB). But the actual file count on the azure blob storage where the delta stores the files is 13.5 Million (size: 31 TB).
The following checks were some checks done:
- the vacuum is running every day with the default 7 days interval.(takes approx 4 hours every day)
- the transaction logs are for the last 30 days
Questions:
- What are these extra files that are existing more than that used by the delta table?
- We would like to delete these extra files and free up the storage space. How can we isolate the files that are used by the delta table? Is there a command to list that?
Note: I am using Azure databricks and currently trying out vacuum dry run command to see if it helps(will update soon).
Thanks is Advance