-1

I have some data in s3 in an aws account. i want to use that in a new machine learning project that i want to work on. to be able to use that data and track that data via dvc, do i need to download the data first to my local machine first and then add it via dvc add command. I understand this will add it lo local cache in my machine and generate hash , write it to .dvc files for tracking purposes. as the data already exists on the s3 , i wouldn't need to do a dvc push after dvc add.

is my logic right here?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
haju
  • 95
  • 6

1 Answers1

1

There are two option if you don't want to download the file locally first.

  1. If you don't want to push your data back to a remote, you can use external inputs.

    You can do that with dvc add --external https://dvc.org/doc/user-guide/data-management/managing-external-data. This will work with your remote and won't push data back to any remote.

    You can also check out this question to see an example of using that "dvc add -external S3://mybucket/data.csv" is failing with access error even after giving correct remote cache configurations

  2. If you're ok to push your artifact back to a remote (it should be a different remote, or different path in the same remote), you can use dvc import-url https://dvc.org/doc/command-reference/import-url

Generally, the latter is preferred due to less mistakes you can do while doing so. You can check out https://dvc.org/doc/user-guide/data-management/managing-external-data for more motivation behind this recommendation.

  • thanks. i will go through the links you sent. over the weekend , i tried dvc and was able to go through few examples. one thing i was confused about was, i set the remote via dvc remote add -d myremote s3://bucket and set up cache as well, => dvc remote add s3cache s3://bucket/cache and dvc config cache.s3 s3cache. when i do dvc add and dvc push. i don't see anything in the cache bucket. i see content in my bucket but not in my cache folder in s3. isn't dvc supposed to push things to cache folder as well? or how should this work? – haju May 30 '23 at 03:29