3

I am using DVC for data version control in machine learning projects. Typically, switching between versions of data is managed to done by checkout git branches, commits, or tags to get appropriate *.dvc files that represent data checksum, then run dvc checkout to update data, for example:

git checkout ddc5c395b2afb2b2a626c62ef63a2c7d85382aa6 # to rollback to an old version of *.dvc files
dvc checkout mydata.dvc # to roll `mydata` back to the previous version 

I now want to use DVC and switch between data versions without using git, what i am expecting is somethings like following:

dvc checkout mydata.dvc --tag v1.0

Could someone please guide me to use dvc in such a way? Thank you for any help.

TaQuangTu
  • 2,155
  • 2
  • 16
  • 30
  • A somewhat related CLI (and API) is `dvc get` - https://dvc.org/doc/command-reference/get#get . E.g this example https://dvc.org/doc/command-reference/get#example-compare-different-versions-of-data-or-model . – Shcheklein May 08 '23 at 14:30

2 Answers2

4

To follow up on @omessor's comment, there are Python packages that allow you to programmatically work with a git repo (without using CLI git). DVC itself uses both dulwich and pygit2 via scmrepo.

You could actually do what you are looking for directly through DVC's internal API like

from dvc.repo import Repo

dvc = Repo("path/to/your/repo")
dvc.scm.checkout("tags/v1.0")  # git checkout tags/v1.0
dvc.checkout("mydata.dvc")  # dvc checkout mydata.dvc

This would only require installing DVC via pip or conda, and does not require a CLI git installation.

Just note that these API's aren't publicly documented, so you may need to take a look at the DVC and scmrepo source to see how it works

https://github.com/iterative/dvc/blob/main/dvc/scm.py

pmrowla
  • 231
  • 2
  • 3
  • Looks potential. Will mark this answer later if it works. Thank you sir. – TaQuangTu May 08 '23 at 09:58
  • Hi @pmrowla, your suggestion is exactly what I am looking for, appreciate your help, it saves my day. However, I can only upvote this answer, not marked it as accepted because it does not match the ultimate goal of the question fully. Once again, thank you. – TaQuangTu May 08 '23 at 10:17
  • Hi @pmrowla, if my dvc remote locates on a ssh server. How should I init and connect to the repo with dvc.repo.Repo? – TaQuangTu May 09 '23 at 06:53
2

What you're trying to achieve is not possible with DVC alone. As you demonstrated, one of DVC's functionalities is to help bring data management into existing VCS (like Git or SVN) so it manages manifests that are easily version controlled as text, and then, you can version them together with your code easily (using the dvc files as placeholder or indirections for your actual data).

DVC does not implement a complete Version Control System on its own (it does not create/manage refs/commits or tags).

I wonder why are you trying to break away from Git while still getting version control functionality. For example, if you are only managing data, it's perfectly acceptable to have a very lightweight git repo, only containing DVC artifacts, so you would get a small repo and the behavior you want from git+dvc without much "cost".

Then you would have something like the following, instead of the command you listed:

$ git checkout tags/v1.0
$ dvc checkout mydata.dvc

It should even be pretty easy to wrap git+dvc coupled commands in some lightweight wrapper or script if you want to avoid the extra typing.

EDIT: This answer relates to using DVC as CLI only. Seems like @TaGuangTu looked for code usage and avoiding using Git CLI specifically. pmrowla's answer answers that exactly

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
omesser
  • 21
  • 1
  • 3
  • Thank you for the prompt response. The reason why i want to use DVC alone is that I want to use DVC programmatically, for example running dvc commands to manage data versions with `subprocess` library in Python. Involving `Git` here makes things more complicated. Any idea to solve it? – TaQuangTu May 08 '23 at 09:28
  • 1
    edit: moved this comment into an answer since you can't have formatted code blocks in comments – pmrowla May 08 '23 at 09:46
  • 1
    Thanks for the answer [@pmrowla](https://stackoverflow.com/users/1538451/pmrowla) - `dvc.scm` to the rescue! His answer seems to be what OP looked for. I'll edit the above to clarify the answer is referring to CLI only – omesser May 08 '23 at 10:54