2

I would like to be able to delete individual files or folders from the DVC cache, after they have been pulled with dvc pull, so they don't occupy space in local disk.

Let me make things more concrete and summarize the solutions I found so far. Imagine you have downloaded a data folder using something like:

dvc pull <my_data_folder.dvc>

This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links)

ls -l my_data_folder

You will see something like:

my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...

Imagine you don't need this data for a while, and you need to free its space from local disk. I know of two manual approaches for doing that, although I am not sure about the second one:

Preliminary step (optional)

Not needed if you have symlinks (which I believe is true, at least in unix-like OS):

dvc unprotect my_data_folder

Approach 1 (verified):

Delete all the cached data. From the repo's root folder:

rm -r my_data_folder
rm -rf .dvc/cache

This seems to work properly, and will completely free the disk space previously used by the downloaded data. Once we need the data again, we can pull it by doing dvc pull as previously. The drawback is that we are removing all the data downloaded with dvc so far, not only the data corresponding to my_data_folder, so we would need to do dvc pull for all the data again.

Approach 2 (NOT verified):

Delete only specific files (to be thoroughly tested that this does not corrupt DVC in any way):

First, take note of the path indicated in the soft link:

ls -l my_data_folder

You will see something like:

my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792

If you want to delete my_data_file_1.pk, from the repo's root folder run:

rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792

Note on dvc gc

For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.

I would appreciate if someone can suggest a better way, or also comment whether the second approach is actually appropriate. Also, if I want to delete the whole folder and not go file by file, is there any way to do that automatically?

Thank you!

Jau A
  • 395
  • 1
  • 10

1 Answers1

1

It's not possible at the moment to granularly specify a directory / file to be removed from the cache. Here are the tickets to vote and ask to prioritize this:

For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.

This is a bit concerning. If you run it with the -w option it keeps only files / dirs that are referenced in the current versions of the .dvc and dvc.lock files. And it should remove everything else.

So, let's say you are building a model:

my_model_file.pk

You created it once and its hash is 4f7bc7702897bec7e0fae679e968d792 and it's written in the dvc.lock or in the my_model_file.dvc.

Then you do another iteration and now hash is different 5a8cc7702897bec7e0faf679e968d363. It should be now written in the .dvc or lock. It means that a model that corresponds to the previous 4f7bc7702897bec7e0fae679e968d792 is not referenced anymore. In this case dvc gc -w should definitely collect it. If that is not happening please create a ticket and we'll try to reproduce and take a look.

Shcheklein
  • 5,979
  • 7
  • 44
  • 53
  • Thank you for your answer. I will vote on the mentioned issues. Just one question: do you know if the approach 2 I discussed is appropriate? (this is to delete the specific files from .dvc/cache based on the hash indicated in the symlink, something like `rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792`? Regarding using `dvc gc`, now I understand it better based on your response. I understood originally that the objective was to free space from files/folders that are no longer needed at all, not even their latest version. I didn't try for deleting previous versions. – Jau A Oct 06 '22 at 19:08