3

I want to use the pipeline functionality of dvc in a git repository. The data is managed otherwise and should not be versioned by dvc. The only functionality which is needed is that dvc reproduces the needed steps of the pipeline when dvc repro is called. Checking out the repository on a new system should lead to an 'empty' repository, where none of the pipeline steps are stored.

Thus, - if I understand correctly - there is no need to track the dvc.lock file in the repository. However, adding dvc.lock to the .gitginore file leads to an error message:

ERROR: 'dvc.lock' is git-ignored.

Is there any way to disable the dvc.lock in .gitignore check for this usecase?

ppmt
  • 167
  • 7

1 Answers1

4

This is definitely possible, as DVC features are loosely coupled to one another. You can do pipelining by writing your dvc.yaml file(s), but avoid data management/versioning by using cache: false in the stage outputs (outs field). See also helper dvc stage add -O (big O, alias of --outs-no-cache).

And the same for initial data dependencies, you can dvc add --no-commit them (ref).

You do want to track dvc.lock in Git though, so that DVC can determine the latest stage of the pipeline associated with the Git commit in every repo copy or branch.

You'll be responsible for placing the right data files/dirs (matching .dvc files and dvc.lock) in the workspace for dvc repro or dvc exp run to behave as expected. dvc checkout won't be able to help you.

Jorge Orpinel Pérez
  • 6,361
  • 1
  • 21
  • 38
  • I am not sure if I understand your point correctly. I am fine with DVC saving the different versions of the outputfiles in the local cache und reuse them if suitable. But this should not be tracked in git by any means (e.g., a new checkout of the repo should have no files in the cache). Thus, I would need to ignore the dvc.lock file or leave it as untracked file forever, right? The different 'versions' of the pipeline are tracked by git as it tracks the source code and the pipeline definition in dvc.yaml anyway. – ppmt Jun 23 '21 at 13:25
  • @ppmt OK. That's the standard usage of DVC then. When you "give" files/dirs to `dvc add` or via dvc.yaml (and `dvc repro`), they get listed in .gitignore automatically so that the data doesn't make it to Git. The DVC cache is local and never up/downloaded except if you setup a DVC remote and `dvc push/pull` explicitly. – Jorge Orpinel Pérez Jun 23 '21 at 18:43
  • 1
    You could leave out or remove dvc.lock, but what that does is to force `dvc repro` to always run the pipelines from scratch. But again, that's already the default behavior in new repo clones, even with dvc.lock, as the caches is empty anyway. Also not having dvc.lock in the Git history would basically make the entire data cache pointless. So I'd keep it, as it's part of the standard DVC repo structure. Think packages.lock for data dependencies/outputs. – Jorge Orpinel Pérez Jun 23 '21 at 18:45
  • Well, the data cache would still be useful for a local copy. Because if I switch between two 'versions' of the pipeline dvc can reload them from the cache. But I do not want to have dvc.lock file with their hashes in the repository, since the cache is not part of the repo. Thus, I think I will have to live with the untracked dvc.lock file. – ppmt Jun 28 '21 at 12:48
  • No, DVC won't be able to reload pipeline versions from cache without dvc.lock. Again, by not having dvc.lock in your repo, you render the cache basically useless. Like I mentioned there's no need to remove it: it won't affect you in new repo copies where the cache is empty. Your use case is the base scenario for DVC — no customization is needed. – Jorge Orpinel Pérez Jun 29 '21 at 15:31
  • I do not want to delete it from the local repository. I just want git to ignore it. – ppmt Jun 30 '21 at 07:59
  • You don't need to gitignore it or delete it. Keep it in Git so DVC can work. – Jorge Orpinel Pérez Jul 01 '21 at 08:26