How to add existing data via dvc?

Question

I have some data in s3 in an aws account. i want to use that in a new machine learning project that i want to work on. to be able to use that data and track that data via dvc, do i need to download the data first to my local machine first and then add it via dvc add command. I understand this will add it lo local cache in my machine and generate hash , write it to .dvc files for tracking purposes. as the data already exists on the s3 , i wouldn't need to do a dvc push after dvc add.

is my logic right here?

score 1 · Answer 1 · answered May 29 '23 at 09:27

There are two option if you don't want to download the file locally first.

If you don't want to push your data back to a remote, you can use external inputs.

You can do that with dvc add --external https://dvc.org/doc/user-guide/data-management/managing-external-data. This will work with your remote and won't push data back to any remote.

You can also check out this question to see an example of using that "dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations
If you're ok to push your artifact back to a remote (it should be a different remote, or different path in the same remote), you can use dvc import-url https://dvc.org/doc/command-reference/import-url

Generally, the latter is preferred due to less mistakes you can do while doing so. You can check out https://dvc.org/doc/user-guide/data-management/managing-external-data for more motivation behind this recommendation.

thanks. i will go through the links you sent. over the weekend , i tried dvc and was able to go through few examples. one thing i was confused about was, i set the remote via dvc remote add -d myremote s3://bucket and set up cache as well, => dvc remote add s3cache s3://bucket/cache and dvc config cache.s3 s3cache. when i do dvc add and dvc push. i don't see anything in the cache bucket. i see content in my bucket but not in my cache folder in s3. isn't dvc supposed to push things to cache folder as well? or how should this work? — haju, May 30 '23 at 03:29

How to add existing data via dvc?

1 Answers1