3

In DVC one may define pipelines. In Unix, one typically does not work at the root level. Further, DVC expects files to be inside the git repository.

So, this seems like a typical problem.

Suppose I have the following:

/home/user/project/content-folder/data/data-type/cfg.json
/home/user/project/content-folder/app/foo.py

Git starts at /home/user/project/

cd ~/project/content-folder/data/data-type
../../app/foo.py do-this --with cfg.json --dest $(pwd) 

Seems reasonable to me: the script takes a configuration, which is stored in a particular location, runs it against some encapsulated functionality, and outputs it to the destination using an absolute path.

The default behavior of --dest is to output to the current working directory. This seems like another reasonable default.


Next, I go to configure the params.yaml file for dvc, and I am immediately confusing and unsure what is going to happen. I write:

foodoo:
  params: do-this --with ????/cfg.json --dest ????

What I want to write (and would in a shell script):

#!/usr/bin/env bash
origin:=$(git rev-parse --show-toplevel)

verb=do-this
params=--with $(origin)/content-folder/data/data-type/cfg.json --dest $(origin)/content-folder/data/data-type

But, in DVC, the pathing seems to be implicit, and I do not know where to start as either:

  1. DVC will calculate the path to my script locally
  2. Not calculate the path to my script locally

Which is fine -- I can discover that. But I am reasonably sure that DVC will absolutely not prefix the directory and file params in my params.yaml with the path to my project.


How does one achieve path control that does not assume a fixed project location, like I would in BASH?

Chris
  • 28,822
  • 27
  • 83
  • 158

1 Answers1

2

By default, DVC will run your stage command from the same directory as the dvc.yaml file. If you need to run the command from a different location, you can specify an alternate working directory via wdir, which should be a path relative to dvc.yaml's location.

Paths for everything else in your stage (like params.yaml) should be specified as relative to wdir (or relative to dvc.yaml if wdir is not provided).

Looking at your example, there also seems to be a bit of confusion on parameters in DVC. In a DVC stage, params is for specifying parameter dependencies, not used for specifying command-line flags. The full command including flags/options should be included the cmd section for your stage. If you wanted to make sure that your stage was rerun every time certain values in cfg.json have changed, your stage's params section would look something like:

params:
  <relpath from dvc.yaml>/cfg.json:
    - param1
    - param2
    ...

So your example dvc.yaml would look something like:

stages:
  foodoo:
    cmd: <relpath from dvc.yaml>/foo.py do-this --with <relpath from dvc.yaml>/cfg.json --dest <relpath from dvc.yaml>/...
    deps:
      <relpath from dvc.yaml>/foo.py
    params:
      <relpath from dvc.yaml>/cfg.json:
        ...
    ...

This would make the command dvc repro rerun your stage any time that the code in foo.py has changed, or the specified parameters in cfg.json have changed.

You may also want to refer to the docs for dvc run, which can be used to generate or update a dvc.yaml stage (rather than writing dvc.yaml by hand)

pmrowla
  • 231
  • 2
  • 3
  • 1
    Great! I got frustrated for a little while, until I figured out that DVC did not make any of the mistakes I expected, and was much more than anticipated. – Chris Dec 23 '20 at 01:32