2

I am trying to set up a DVC repository for machine learning data with different tagged versions of the dataset. I do this with something like:

$ cd /raid/ml_data  # folder on a data drive
$ git init
$ dvc init
$ [add data]
$ [commit to dvc, git]
$ git tag -a 1.0.0
$ [add or change data]
$ [commit to dvc, git]
$ git tag -a 1.1.0

I have multiple projects that each need to reference some version of this dataset. The problem is I can't figure out how to set up those projects to reference a specific version. I'm able to track the HEAD of the repo with something like:

$ cd ~/my_proj  # different drive than the remote
$ mkdir data
$ git init
$ dvc init
$ dvc remote add -d local /raid/ml_data  # add the remote on my data drive
$ dvc cache dir /raid/ml_data/.dvc/cache  # tell DVC to use the remote cache
$ dvc checkout
$ dvc run --external -d /raid/ml_data -o data/ cp -r /raid/ml_data data

This gets me the latest version of the dataset, symlinked into my data folder, but what if I want some projects to use the 1.0.0 version and some to use the 1.1.0 version, or another version? Or for that matter, if I update the dataset to 2.0.0 but don't want my existing projects to necessarily track HEAD and instead keep the version with which they were set up?

It's important to me to not create a ton of local copies of my dataset as the /home drive is much smaller than the /raid drive and some of these datasets are huge.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Engineero
  • 12,340
  • 5
  • 53
  • 75

1 Answers1

1

I think you are looking for the data access set of commands.

In your particular case, dvc import makes sense:

$ dvc import /raid/ml_data data

if you want to get the most recent version (HEAD). Then you will be able to update it with the dvc update command (if 2.0.0 is released, for example).

$ dvc import /raid/ml_data data --rev 1.0.0

if you'd like to "fix" it to the specific version.

Avoiding copies

Make sure also, that symlinks are set for the second project, as described in the Large Dataset Optimization:

$ dvc config cache.type reflink,hardlink,symlink,copy

(there are config modifiers --global, --local, --system to set this setting for everyone at once, or just for one project, etc)

Check the details instruction here.


Overall, it's a great setup, and looks like you got pretty much everything right. Please, don't hesitate to follow up and/or create other questions here- we'll help you with this.

Shcheklein
  • 5,979
  • 7
  • 44
  • 53
  • That's great, thank you. How do I make sure that it uses symlinks for the project in the home directory? It seems like if I just use your first command I create a local copy that's linked to the "remote" on the raid drives. – Engineero Nov 02 '20 at 21:26
  • 1
    @Engineero please take a look, I put some additional info and some links the to the relevant docs. – Shcheklein Nov 03 '20 at 00:16