3

In my project, I already have some files tracked by DVC that I added with dvc add. And now I want to create stages using thses files as outputs and dependencies, but when I try to create a stage I get an error that says ERROR: output '[FILE NAME]' is already specified in stages.

I assume that dvc add add the files to the dependency graph as outputs, thus when I try to include them in a stage it creates a conflict, but I couldn't find anything on the official docuemntation confirming it. So now am confused on how to add outputs to a stage is theses outputs are already tracked by DVC.

Here is an example of what the error I get when creating a stage

>>> dvc stage add -n train -d data/data.csv -o models/model python train.py

ERROR: output 'models/model' is already specified in stages:
        - models/model.dvc
        - train

In this example the file data/data.csv and directory models/model are already added to dvc but are not added to any stage, however they are present in the dependency graph.

So how do I include theses files into a DVC Stage ? Is there a way to do it without having to remove the files from DVC then add them directly through a Stage?

Ken White
  • 123,280
  • 14
  • 225
  • 444
Aymen
  • 98
  • 6
  • I don't understand why you thought you had to add *DVC |* to your post title when you used it twice more in the title itself, you tagged it DVC, and you use DVC seven more times in the post body, so I removed it. It's pretty clear that your question is about DVC without the added noise. – Ken White Dec 24 '22 at 03:31
  • @kenWhite My bad I just saw a previous question with the same format I went with it – Aymen Dec 24 '22 at 14:01
  • As a general rule, it's not necessary to use the tag info in the title at all, as the tag system works extremely well. The exceptions are for rare cases where the use in the title clarifies something specific (you''re using a new version of something that does not yet have a tag available, so you tag with the current version and mention the new one in the title, for example). SO goes so far as to incorporate the tags in the SEO information, so search engines can use them to help find questions that use them along with the title and content. – Ken White Dec 24 '22 at 14:35
  • @kenWhite Okay, that's noted for future questions ! – Aymen Dec 24 '22 at 14:42

1 Answers1

1

DVC stage outputs are automatically tracked by DVC, you don't need to do dvc add on them. If you already have done it before, you can safely un-track it with dvc remove first:

Note that the actual output files or directories of the stage (outs field) are not removed by this command, unless the --outs option is used.

One thing to mention / note. When you create a stage and run it, it removes outputs (unless a persistence flag is specified). This done for reproducibility, it's expected that your stage produces its outputs every time it runs.

Shcheklein
  • 5,979
  • 7
  • 44
  • 53
  • 2
    Thank you for your answer, I guess I'll just remove them first then. But what do you mean by `When you create a stage and run it, it removes outputs` ? Do you mean that they get deleted when I run the stage, or after I have run the stage ? If I run the stages with `dvc repro` for example and then do a `dvc push` the outputs should be tracked in my remote right? – Aymen Dec 24 '22 at 14:07
  • 2
    Yes, if you run `dvc repro` + `dvc push` they will produced and DVC-tracked and saved. What I was trying to say (may be it's obvious, but worth making a note I think) - is that when you do `dvc repro` (or `dvc exp run`) existing stage outputs are deleted. – Shcheklein Dec 24 '22 at 18:28