2

I am new to DVC, and so far I like what I see. But possibly my question is fairly easy to answer.

My question: how do we correctly track the dependencies to files in an original hugedatarepo (lets assume that this can also change) in a derivedData project, but WITHOUT the huge files being imported generally when the derived data is checked out? I don't think I can use dvc import to achieve this.

Details: We have a repository with a large amount of quite big data files (scans) and use this data to design and train various algorithms. Often we want to use only specific files and even only small chunks from within the files for training, annotation and so on. That is, we derive data for specific tasks, that we want to put in new repositories.

Currently my Idea is to dvc get the relevant data, put it in a untracked temporary folder and then again manage the derived data with dvc. But still to put in the dependency to the original data.

hugeFileRepo
 +metaData.csv
 +dataFolder
 +-- hugeFile_1
 ...
 +-- hugeFile_n

in the derivedData repository I do

 dvc import hugeFileRepo.git metaData.csv
 dvc run -f derivedData.dvc \
    -d metaData.csv \
    -d deriveData.py \
    -o derivedDataFolder \
    python deriveData.py 

My deriveData.py does something along the line (pseudocode)

metaData = read(metaData.csv)

#Hack because I don't know how to it right:
gitRevision = getGitRevision(metaData.csv.dvc)          
...
for metaDataForFile, file in metaData:
   if(iWantFile(metaDataForFile) ):
      #download specific file
      !dvc get --rev {gitRevision} -o tempFolder/{file} hugeFileRepo.git {file}

      #do processing of huge file and store result in derivedDataFolder
      processAndWrite(tempFolder/file)

So I use the metaData file as a proxy for the actual data. The hugeFileRepo data will not change frequently and the metaData file will be kept up to date. And I am absolutely fine with having a dependency to the data in general and not to the actual files I used. So I believe this solution would work for me, but I am sure there is a better way.

tePer
  • 301
  • 2
  • 6

1 Answers1

1

This is not a very specific answer because I'm not sure I understand the details and setup completely, but in general here are some ideas:

We have a repository with a large amount of quite big data files (scans)... Often we want to use only specific files and even only small chunks

DVC commands that accept a target data file shoould support granularity (see https://github.com/iterative/dvc/issues/2458), meaning you can import only specific files from a tracked directory. As for chunks, there's no way for DVC to only import certain parts of files from CLI, this would require sematically understanding all possible data formats.

we derive data for specific tasks

Looking at this step as a proper DVC stage (derivedData.dvc) seems like the best approach here, and this stage depends on the full original data files (again, no way for DVC to know in advance what parts of the data the source code will actually use).

Since you're using Python though, there is an API to open and stream data from online DVC repos directly into your program at runtime, so deriveData.py could do that, without need to import or download anything previously. See https://dvc.org/doc/api-reference


Sorry, I don't think I understand the intention or relationship with the main question of the last code sample where git revisions are being used.

Jorge Orpinel Pérez
  • 6,361
  • 1
  • 21
  • 38