1

I see in the 'data' tab of databricks that the number of files used by delta table is 20000(size:1.6TB). But the actual file count on the azure blob storage where the delta stores the files is 13.5 Million (size: 31 TB).

The following checks were some checks done:

  • the vacuum is running every day with the default 7 days interval.(takes approx 4 hours every day)
  • the transaction logs are for the last 30 days

Questions:

  • What are these extra files that are existing more than that used by the delta table?
  • We would like to delete these extra files and free up the storage space. How can we isolate the files that are used by the delta table? Is there a command to list that?

Note: I am using Azure databricks and currently trying out vacuum dry run command to see if it helps(will update soon).
Thanks is Advance

SriramN
  • 432
  • 5
  • 19
  • 1
    look into the history, most probably you're updating/replacing table very often, so the data is stored there until is deleted by vacuum – Alex Ott Mar 20 '21 at 13:58
  • @AlexOtt, we are running vacuum with default 7 days and is running daily. Yes, we are updating/replacing files very often in a streaming way but still 20K/13.5M files seems something wrong. – SriramN Mar 22 '21 at 11:07

0 Answers0