I am trying to set up a DVC repository for machine learning data with different tagged versions of the dataset. I do this with something like:
$ cd /raid/ml_data # folder on a data drive
$ git init
$ dvc init
$ [add data]
$ [commit to dvc, git]
$ git tag -a 1.0.0
$ [add or change data]
$ [commit to dvc, git]
$ git tag -a 1.1.0
I have multiple projects that each need to reference some version of this dataset. The problem is I can't figure out how to set up those projects to reference a specific version. I'm able to track the HEAD
of the repo with something like:
$ cd ~/my_proj # different drive than the remote
$ mkdir data
$ git init
$ dvc init
$ dvc remote add -d local /raid/ml_data # add the remote on my data drive
$ dvc cache dir /raid/ml_data/.dvc/cache # tell DVC to use the remote cache
$ dvc checkout
$ dvc run --external -d /raid/ml_data -o data/ cp -r /raid/ml_data data
This gets me the latest version of the dataset, symlinked into my data
folder, but what if I want some projects to use the 1.0.0
version and some to use the 1.1.0
version, or another version? Or for that matter, if I update the dataset to 2.0.0
but don't want my existing projects to necessarily track HEAD
and instead keep the version with which they were set up?
It's important to me to not create a ton of local copies of my dataset as the /home
drive is much smaller than the /raid
drive and some of these datasets are huge.