4

I have a couple of projects that are using and updating the same data sources. I recently learned about dvc's data registries, which sound like a great way of versioning data across these different projects (e.g. scrapers, computational pipelines).

I have put all of the relevant data into data-registry and then I imported the relevant files into the scraper project with:

$ poetry run dvc import https://github.com/username/data-registry raw

where raw is a directory that stores the scraped data. This seems to have worked properly, but then when I went to build a dvc pipeline that outputted data into a file that was already tracked by dvc, I got an error:

$ dvc run -n menu_items -d src/ -o raw/menu_items/restaurant.jsonl scrapy crawl restaurant
ERROR: Paths for outs:                                                
'raw'('raw.dvc')
'raw/menu_items/restaurant.jsonl'('menu_items')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.

Can someone help me understand what is going on here? What is the best way to use data registries to share and update data across projects?

I would ideally like to update the data-registry with new data from the scraper project and then allow other dependent projects to update their data when they are ready to do so.

Jorge Orpinel Pérez
  • 6,361
  • 1
  • 21
  • 38
dino
  • 3,093
  • 4
  • 31
  • 50
  • I get a similar but different error message when I `dvc import` the restaurant.jsonl file instead of the entire `raw` directory: "ERROR: output 'restaurant.jsonl' is already specified in stage: 'restaurant.jsonl.dvc'." – dino Feb 28 '21 at 13:13

1 Answers1

1

When you import (or add) something into your project, a .dvc file is created with that lists that something (in this case the raw/ dir) as an "output".

DVC doesn't allow overlapping outputs among .dvc files or dvc.yaml stages, meaning that your "menu_items" stage shouldn't write to raw/ since it's already under the control of raw.dvc.

Can you make a separate directory for the pipeline outputs? E.g. use processed/menu_items/restaurant.jsonl

Jorge Orpinel Pérez
  • 6,361
  • 1
  • 21
  • 38
  • Got it. So the "data registry" from DVC isn't necessarily a centralized repo of data which is how I read the docs. It sounds like the best practice is that each subproject can make it's output data available to other projects, which can then import it into those projects. Thanks! – dino Mar 01 '21 at 11:10
  • They do intend to help centralize data but projects that *consume* them can't contribute changes back because they're separate DVC projects. You may want to use a shared DVC remote or even DVC cache instead. E.g. https://dvc.org/doc/use-cases/shared-development-server – Jorge Orpinel Pérez Mar 02 '21 at 15:36