3

I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?

nguyendhn
  • 423
  • 1
  • 6
  • 19

1 Answers1

3

You should be safe (at least data is not gone) most likely. From the dvc remove docs:

Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run dvc push to save the files you actually want to use or share in the future.

So, if you created training_data.dvc as with dvc add and/or dvc run and dvc remove -p didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache.

There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).

Recovering a directory

First of all, here is the document that describes briefly how DVC stores directories in the cache.

What we can do is to find all .dir files in the .dvc/cache:

find .dvc/cache -type f -name "*.dir"

outputs something like:

.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir

(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)

Each .dir file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):

  • Check the date modified (if you remember when this data was added).
  • Check the content of those files - if you remember a specific file name that was present only in the directory you are looking for - just grep it.
  • Try to restore them one by one and check the directory content.

Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir, (e.g. because content of it looks like:

[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]

and we want to get a directory with train.tsv).

The only thing we need to do is to create a .dvc file that references this directory:

outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
  path: my-directory

(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)

And run dvc pull on this file.

That should be it.

Shcheklein
  • 5,979
  • 7
  • 44
  • 53
  • Thank you for your comment. Actually, I have just heard and started to try DVC recently. I believe that there are many things I don't now or realize. For now, I haven't created any pipeline yet, so I don't use `dvc run`. The process I used is: conduct data for training (manually using Python script) -> training -> `dvc add` training data and h5 models -> `dvc push` to remote storage. If something changes in the process of conducting data (i.e, reduce the size of images), I have to redo those steps again! I think it's an inefficient approach. I have forgotten about `dvc commit`! – nguyendhn Jun 18 '20 at 15:51
  • As I remember, I had added a folder contains images (included labels) named training_data (`dvc add`), then `dvc push`. After that, I replace that folder with another one (totally different subfolders and images but still the same parent folder and rerun `dvc add`). I encountered an error (I can't remember exactly) then I used `dvc remove -p` and thought that it just deleted dvc tracking file ... – nguyendhn Jun 18 '20 at 15:51
  • @nguyendhn updated the answer, it now includes the brief instruction how to recover directories. Please, give it a try and let me know if hit any bumps. – Shcheklein Jun 18 '20 at 16:49
  • 1
    Thank you so much for your clear explanation. I understand the process you suggested. Although I cannot find the `.dir` I need anymore (maybe I have done another silly action that I cannot remember), I have re-produced the scenario with few samples, then applied your approach and it worked! I appreciate your support. :) I'm really happy if we have another efficient way to deal with this issue in the future. For now, I will continue to explore other functionalities of DVC. ;) – nguyendhn Jun 19 '20 at 05:02