I have a couple of projects that are using and updating the same data sources. I recently learned about dvc's data registries, which sound like a great way of versioning data across these different projects (e.g. scrapers, computational pipelines).
I have put all of the relevant data into data-registry
and then I imported the relevant files into the scraper project with:
$ poetry run dvc import https://github.com/username/data-registry raw
where raw
is a directory that stores the scraped data. This seems to have worked properly, but then when I went to build a dvc pipeline that outputted data into a file that was already tracked by dvc, I got an error:
$ dvc run -n menu_items -d src/ -o raw/menu_items/restaurant.jsonl scrapy crawl restaurant
ERROR: Paths for outs:
'raw'('raw.dvc')
'raw/menu_items/restaurant.jsonl'('menu_items')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
Can someone help me understand what is going on here? What is the best way to use data registries to share and update data across projects?
I would ideally like to update the data-registry with new data from the scraper project and then allow other dependent projects to update their data when they are ready to do so.